General Disclaimer 


One or more of the Following Statements may affect this Document 


• This document has been reproduced from the best copy furnished by the 
organizational source. It is being released in the interest of making available as 
much information as possible. 


• This document may contain data, which exceeds the sheet parameters. It was 
furnished in this condition by the organizational source and is the best copy 
available. 


• This document may contain tone-on-tone or color graphs, charts and/or pictures, 
which have been reproduced in black and white. 


• This document is paginated as submitted by the original source. 


• Portions of this document are not fully legible due to the historical nature of some 
of the material. However, it is the best reproduction available from the original 
submission. 


Produced by the NASA Center for Aerospace Information (CASI) 



5759/S 


SOFTWARE ENGINEERING LABORATORY SEL-83-008 


S EL-83-008 


(NASA-IH-85445) FFOC EEDINGS CF THE EIGHTH 
ANNUAL SOFTWARE ENGINEERING WCFK3HOF (NASA) 
326 p HC A15/H? A31 CSCL 09R 


G3/61 


N84-23137 

THRU 

N84-23149 
Unclas 
1 9376 


PROCEEDINGS OF THE 
EIGHTH ANNUAL SOFTWARE 
ENGINEERING WORKSHOP 


N/\S/\_ 

National Aeronautics and 
Spa-.'e Administration 

Goddard Space Flight Center 

Greenbelt, Maryland 20771 


NOVEMBER 1983 



/ 


PROCEEDINGS 


OF 

EIGHTH ANNUAL SOFTWARE ENGINEERING WORKSHOP 


Organized by: 

Software Engineering Laboratory 
GSFC 


November 30, 1983 


GODDARD SPACE FLIGHT CENTER 
Greenbelt, Maryland 




FOREWORD 


The Software Engineering Laboratory (SEL) is an organization 
sponsored by the National Aeronautics and Space Administra- 
tion Goddard Space Flight Center (NASA/GSFC) and created for 
the purpose of investigating the effectiveness of software 
engineering t e ch n °l°9i es when applied to the development of 
applications ; software. The SEL was created in 1977 and has 
three primary organizational members: i 

NASA/GSFC (Systems Development and Analysis Branch) 

The University of Maryland (Computer Sciences Department) 

Computer Sciences Corporation (Flight Systems Operation) 

The goals of the SEL are (1) to understand the software de- 
velopment process in the GSFC environment; (2) to measure 
the effect of various methodologies, tools, and models on 
this process; and (3) to identify and then to apply success- 
ful development practices. The activities, findings, and 
recommendations of the SEL are recorded in the Software En- 
gineering Laboratory Series, a continuing series of reports 
that includes this document. 

y 

Single copies of this document can be obtained by writing to | 

1 

Frank E. McGarry i 

Code 582.1 * 

NASA/GSFC 

Greenbelt, Maryland 20771 I 




EIGHTH ANNUAL SOFTWARE ENGINEERING WORKSHOP 
ABOUT THE WORKSHOP 


The Eighth Annual Software Engineering Workshop was held On November 3, 
1983 at NASA/Goddard Space Flight Center in Greenbelt, MD. Once again, 
the attendance approached 250 persons representing 5 universities, 23 
agencies of the federal government and 44 private companies. 

The four major topics of discussion included: 1. The NASA Software 

Engineering Laboratory, 2. Software Testing, 3. Human Factors in 
Software Engineering and 4, Software Quality Assessment. As in the past 
years, there were 12 position papers presented (3 for each topic) 
followed by questions and very heavy participation by the general 
audience. 

The workshop is organized by the Software Engineering Laboratory (SEL), 
whose members represent the NASA/GSFC, University of Maryland, and 
Computer Sciences Corporation (CSC). The meeting has been an annual 
event for the past 8 years (1976 to 1983), and there are plans to 
continue this event as long as it is felt they are productive. 

This record of the meeting is generated by the SEL and is printed and 
distributed by the Goddard Space Flight Center. All persons who are 
registered on the mail list of the SEL receive a copy at no charge. 

Additional information about the workshop or about the SEL may be 
obtained by contacting: 

Frank E. McGarry 
N.'WGSFC 
Cole 582 

Greenbelt, MD 20771 


301-344-6846 


EIGHTH ANNUAL SOFTWARE ENGINEERING WORKSHOP 
NASA/GODDARD SPACE FLIGHT CENTER 
BUILDING 3 AUDITORIUM 
NOVEMBER 30, 1983 


8:00 a.m. 

Registration — ‘Sign In’ 
Coffee Donuts 


8:45 a.m. 

INTRODUCTORY REMARKS 

J. J. Qiianil, Deputy Director 
(NASA/GSFC) 

9:00 a.m. 

Session No. 1 

Topic: Current Research in the Software 
Engineering Laboratory (SEL) 



Discussant: F. E. McGarry (NASA/GSFC) 


“Evaluating Software Engineering */ 
Technologies in the SEL” 

D. Card (CSC) 


“Dynamic Metrics for Software 
Management” 

V. Basili (University of MD) 


“Characteristics of a Rapid 
Prototyping Experiment” 

M. Zelkowitz (University of MD) 

10:30 a.m. 

BREAK 


1 1 :00 a.m. 

Session No. 2 

Topic: Testing Software 



Discussant: .1. Page (CSC) 


“Structural Coverage of 
Functional Testing” 

J . Ramsey (University of MD) 


“A Methodology for Detecting 
Errors” 

A. Goel (Syracuse University) 


“Testing and Error Analysis of 
a Real-Time Controller” 

C. Savolaine (Bell Labs) 


12:30 p.m. LUNCH 


Session No. 3 


Topic: Human Factors 

Discussant: V. Basili (University of MD) 
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1 .30 p.m. 


3:00 p.m. 
3:30 p.m. 


5:00 p.m. 


I 

h 


i 



“Transformations of Software 
Design and Code May Lead to 
Reduced Errors” 

“You Can Observe a Lot by Just 
Watching How Designers Design” 

“Evaluating Multiple Coordinated 
Windows for Programmer 
Workstations” 

BREAK 

Session No. 4 


“Cleanroom Certification 
Model” 

“Projecting Manpower to 
Attain Quality ” 

“An Approach to Software 
Baseline Generation” 

ADJOURN 


E. Connelly (PMA, Inc.) 

E. Soloway (Yale) 

C. Grantham (University of MD) 

Topic: Quality Assessment 
Discussant: W, Agresti (CSC) 

P. Currit (IBM) 

K. Rone (IBM) 

J. Romeu (IITRI) 


SUMMARY OF THE SESSIONS: EIGHTH ANNUAL SOFTWARE 
ENGINEERING WORKSHOP 


Prepared for the 
NASA/GSFC 

EIGHTH ANNUAL SOFTWARE ENGINEERING WORKSHOP 

by 

Thomas A. Babst 

COMPUTER SCIENCES CORPORATION 
and 

THE GODDARD SPACE FLIGHT CENTER 
SOFTWARE ENGINEERING LABORATORY 



INTRODUCTORY REMARKS 


John J. Quann , Deputy Director, Goddard Space Flight Center 
(GSFC), made the opening remarks at GSFC's Eighth Annual 
Software Engineering Workshop. He stressed the importance 
of Software Engineering Laboratory (SEL) activities to GSFC 
and pointed out the effect of this work on the Spacelab 
project and its relevance to future projects such as the 
Space Telescope and Space Station. 


The Space Station, for example, will require all NASA cen- 
ters to work together in a disciplined manner. NASA will be 
studying the results of SEL research to identify strategies 
for the design, implementation, testing, and interfacing of 
the software system. Mr. Quann also emphasized the impor- 
tance of conferences, such as this one, as opportunities for 
the exchange of ideas among managers, developers, and acad- 
emicians. This is the route to excellence in the field of 
software engineering. 
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SESSION 1 - CURRENT RESEARCH IN THE SOFTWARE 



Frank McGarry --"The Software Engineering Laboratory" 

Frank McGarry of GSFC summarized the efforts of the Software 
Engineering Laboratory over the past year. Mr. McGarry ex- 
plained that the SEL is a consortium that also includes 
Computer Sciences Corporation and the University of Maryland. 
The SEL has concentrated its efforts in four major areas of 
software engineering research: software reliability and 

testing, technology evaluation, software measures, and soft- 
ware development management. 

Many experiments have been performed by the SEL on produc- 
tion projects to evaluate software development technologies 
and to test software engineering theories. The results of 
some of these activities are being presented at this work- 
shop. One of the principal areas of future activities will 
be the development of a software management environment to 
provide managers with the tools necessary to monitor and 
control the software development process. 


T. Babst 
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Dave Card (Computer Sciences Corporation) --"Evaluating 
Software Engineering Technologies in the SEL" 


Mr. Card's presentation described the results of a study 
that measured the effects of some software engineering prac- 
tices, tools, and techniques on productivity and reliability 
in a production environment. The study was based on a 
sample of 22 similar software systems selected from the SEL 
data base. Eight widely used and accepted technologies were 
evaluated: quality assurance, software tools, documenta- 

tion, code reading, top-down development, chief programmer 
team, structured coding, and design time. A statistical 
technique was employed to compensate for the effects of non- 
technological factors such as program- mer effectiveness and 
computer use. 


The study concluded that none of the individual technologies 
evaluated had a significant effect on productivity during 
development; however, reliability was increased signifi- 
cantly by quality assurance, documentation, and code read- 
ing. A 30-percent improvement was achieved with these 
technologies, and other benefits may also be obtained. In 
particular, a reduction of maintenance costs seems probable. 


In response to questions and comments from the audience, 

Mr. Card clarified the following points: 

• All systems studied passed their acceptance tests, 
thus the quality of each was at least "good." 


• The measure of programmer effectiveness used was a 
weighted measure of years of experience. 

• Productivity was measured during development. That 
is, it is based on the cost to deliver the system 
to the customer. Subsequent maintenance costs are 
not included. 


T. Babst 
CSC 
3 of 18 


Victor Basili (University of Maryland) --"Dynamic Metrics for 
Software Management" 

Dr. Basili' s presentation described several efforts related 
to the development of a general methodology for monitoring 
software development for the early detection of problems. A 
pilot study, tool implementation, and extension activities 
were discussed. 

The approach of the pilot study was to develop a series of 
baselines for critical measures* The actual values realized 
by a project under development can be compared with the base- 
lines to detect significant deviations. A set of explana- 
tions was defined for each type of deviation, and the 
methodology provided a mechanism for rating the probability 
of these explanations. In the pilot study, data from eight 
projects formed the baseline, and one other project was com- 
pared with them. 

Dr. Basili indicated that future plans include extending the 
methodology to include additional measures and developing a 
knowledge-based system incorporating this methodology. The 
system will be developed using KMS (a software system used 
in constructing knowledge-based systems) at the University 
of Maryland. Dr. Basili stressed that this system is not 
intended to replace a manager's expert judgment but rather 
to support it with a formal tool. 

In response to questions and comments from the audience, 

Dr. Basili clarified the following points: 

• Measurement can be extended to the whole life 
cycle, and this option is under study. 

• The baselines are defined at discrete points corre- 
sponding to specific percents of work completed. 

In practice, it is difficult to determine the per- 
cent completion of a project under development. 

. V T Babst 
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The best way to do this can be determined only by 
studying the environment in which the methodology 
is to be applied. 

• Rate of change can be used as an indicator but is 
not in the current methodology. 

• Programmers in this environment do not appear to be 
changing their behavior to match the metrics. 

• The KMS knowledge-based system may be transport- 
able, but that is not an important consideration 
now. 


T. Babst 







Marvin Zelkowitz (University of Maryland) --"Character istics 

of a Rapid Prototyping Experiment” 

Dr. Zelkowitz discussed the issues of prototyping in the 
context of an actual prototype recently developed for GSFC. 
This prototype, the Flight Dynamics Analysis System (FDAS) , 
is currently under evaluation. 

FDAS is intended to provide an integrated software develop- 
ment environment for spacecraft attitude, orbit, and mission 
analysis research. It consists of a management system and a 
library of application software. The application software 
was implemented in an extended version of FORTRAN that pro- 
vides data abstraction and generalized input/output 
capabilities. 

Dr. Zelkowitz provided three definitions of a prototype: a 

"quick and dirty" throwaway, a partial implementation, and 
first build. Some portions of a quick and dirty prototype 
may be reused later in the final system. A prototype need 
not be cheap to be cost-effective if it enables the full 
system to be implemented less expensively and with greater 
reliability than it would have been without the prototype. 

In response to questions and comments from the audience, 

Dr. Zelkowitz clarified the following points; 

• The goal of FDAS was not to save code, although 

much will probably be reused. ; ' 1 

• Forty-eight percent of the development effort was 
spent in implementation. This phase includes cod- 
ing and unit testing activities. 

• The full FDAS system may be implemented in a lan- 
guage other than FORTRAN. 

• FDAS is in the public domain and will probably be 
made available through COSMIC when it is completed. 
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SESSION 2 


TESTING SOFTWARE 


Jim Ramsey (University of Maryland) --"Structural Coverage of 
Functional Testing* 

Mr. Ramsey described the initial results of an evaluation of 
the effectiveness of functional testing by examining struc- 
tural coverage metrics. A FORTRAN program consisting of 
68 subroutines was instrumented to produce structural cover- 
age measures when executed. Then, structural coverage data 
were collected by performing (functional) acceptance tests. 
These results were compared with data from operational use 
of the program. 

Mr. Ramsey reported that although the acceptance tests and 
operational use largely covered the same software, there 
were significant differences. Also, about one-third of the 
code was never executed. However, this procedure does have 
the potential for providing a numerical measure of the ef- 
fectiveness of (functional) acceptance tests. 

A much larger piece of software is now in the process of 
being instrumented and tested in this manner. More concrete 
results should be derived from this additional data. 

In response to the questions and comments from the audience, 
Mr . Ramsey clarified the following points: 

• Conclusions cannot be made at this time about 
whether larger or smaller modules are more fully 
exercised or about the nature of the untested code. 

• The tests performed were derived from the func- 
tional requirements of the program, not from knowl- 
edge of the code. 


T. Babst 
CSC 
7 of 18 







Amrit Goel (Syracuse University) --"A Methodology for 
Detecting Errors" 

i Dr. Goel described a mathematical approach to selecting 

software tests. No testing strategy can detect all errors; 
however, Error Specific Tests (ESTs) can be devised to iso- 
late those types of errors important to the tester. 

In this approach, test requirements are formulated in alge- 
braic notation. Tests are determined from the requirements 
specification and its functional decomposition. Next, tests 
specific to each type of error targeted by the user are de- 
veloped and enumerated in a test plan. This process of de- 
fining functional requirements and structural parts may also 
provide insight to software complexity. 

■ In response to questions and comments from the audience, 

Dr, Gbel clarified the following points: 

\ » The methodology discussed has not been tested on 

actual software development problems. 

•r Optimization of the test plan is necessary to avoid 
redundant tests. 

i • Automation is essential because of the complexity 

I :■■■■■ 1 1 ‘ . 1 

i and comprehensiveness of the resulting test plan. 

I • This method of testing is different from program 

l proofs, although the notation is similar. 
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Cathy Savolaine (Bell Laboratories) --"Testing and Error 

Analysis of a Real-Time Controller" 

Ms. Savolaine reported the results of an error analysis 
based on data collected from the development and testing of 
a real-time communications controller system. The system 
studied was the Satellite Network Scheduler (SNS) , which 
controls ground stations as part of a reservation system for 
picturephone conferencing. Testing for each release was 
performed by an individual not involved in the development 
of that release. The number of errors per module was corre- 
lated with module size and cyclomat ic complexity. Errors 
were classified in three groups: omission, commission, and 

requirements. Half of the errors detected before delivery 
were errors of omission. In contrast, half of the errors 
found during operational usage were errors of commission. 

Ms. Savolaine concluded from these results that complex 
modules should be avoided, more code inspections should be 
performed, and developers should look harder for commission 
errors because these were the principal type found by the 
user. 

In response to questions and comments, Ms. Savolaine clari- 
fied the following points: 

• Records were kept of the numbers of errors found 
during code inspection, but the data are not 
readily available. 

• The development cost of an automated test package 
was included in the SNS development budget. 

• Errors of commission were not further categorized, 
but this can be done. 



T. Babst 
CSC 
9 of 18 




• It is not known at this time why fatal errors 
seemed to cluster in the simpler modules. 

• The total number of errors, not error rate, was 
compared with size and complexity. 
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SESSION 3 - HUMAN FACTORS 


Ed Connelly (PMA) --"Transformation of Software Design and 
Code May Lead to Reduced Errors" 

Mr. Connelly described a series of experiments conducted to 
determine how well people can use examples to specify logic. 
In this study, individuals were asked to devise solution 
algorithms to various problems (specifically, scheduling and 
allocation problems) . 

The problems were initially given to accountants, and later 
to programmers. The solution algorithms were fed to an in- | 
ductive processor. Feedback from the processor helped to 
systematize the subjects 5 thinking. The solution algorithms 
were compared with FORTRAN programs, and both were tested 
for correctness. 

Based on the results of these experiments, Mr. Connelly con- 
cluded that performance is correlated with the number of 
languages and operating systems the programmer is familiar 
with. He also indicated that the examples had fewer errors 
of commission than FORTRAN code developed for the same 
problem. 

In response to questions and comments from the audience, 

Mr. Connelly clarified the following point: 

. . I ' : . 

• The dependent variable in the analysis was perform- 
ance ( i . e . , the number of incorrect inputs recog- 
nized by the program). 


Elliot Soloway (Yale University) --"You Can Observe a Lot by 
Just Watching" (How Designers Design) 

Dr. Soloway described some observations made during a study 
of the work habits of novice and experienced software de- 
signers. The experts had 8 or more years of experience; the 
novices had 2 years or less; all were familiar with telecom- 
munications system software. 

Each individual was given the same vague set of specifica- 
tions for an electronic mail system and was asked to develop 
a design. The design process was recorded on videotape. An 
interviewer prompted the designers to describe what steps 
they were taking. The experts approached the problem sys- 
tematically in a top-down fashion- They kept detailed notes 
of assumptions, constraints, and expectations. In contrast, 
the novices immediately began working on the problem at a 
very detailed level. 

One conclusion drawn by Dr. Soloway was that an effective 
design tool should provide a capability for keeping track of 
notes of the type made by the experts. Most such tools de- 
veloped in the past have focused on what the designer should 
be doing rather than on facilitating what he/she actually 
does . 

In response to questions and comments from the audience. 

Dr. Soloway Clarified the following points: 

• The expert designers were very individualistic. 

• The experts seemed to have some familiarity with 
the problem. It would be interesting to test them 
in other circumstances. 

« The experts were clearly designers, whereas the 

novices could have been programmers who were asked 
to design. , 




• The experts continued to "back up" if questions 
remained unanswered. It would be interesting to 
see and measure where this backup occurs. 

• It might be possible to build a system to teach 
novices to become experts in design. 

• The experts and novices were separated, but it 
might be interesting to see how they worked 
together . 

• The experiment was exploratory in nature, rather 
than a rigorous test of any hypothesis. 
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Charles Grantham (University of Maryland) --"Evaluating 

Multiple Coordinated Windows for Programming 
Workstations" 

Dr. Grantham described the results of some recent research 
on the design of a multiple-screen programmer workstation. 

Two such workstation designs are under evaluation. One sta- 
tion consists of three separate screens; the other consists 
of one screen with four windows. The information on each 
screen or window is coordinated with the others. The appro- 
priate information to be displayed on each window was deter- 
mined by observing the behavior of programmers while 
testing, debugging, and modifying software. The module 
specification, structure chart, and source listing are dis- 
played under both configurations. The four-window configu- 
ration has an additional user-defined area. Ultimately, 
better workstation designs should improve the software de- 
velopment process by maximizing the number of tools that are 
available to the programmer at one time. 

In response to questions and comments from the audience, 

Dr. Shneiderman and Dr. Grantham clarified the following 
points : 

• Many multiple-screen systems do exist, but most are 
passive displays that do not have coordinated 
screen action. This study addresses dynamic screen 
coordination. 

• Software maintenance will be facilitated by using 
multiple screens in this manner, because additional 
details about the module being maintained will be 
available . 
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The importance of left/right orientation should be 
considered when selecting and arranging display 
contents. 




• The layout of information in different screens or 

windows was essentially fixed (not dynamically con- 
trolled by the user) . 
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SESSION 4 - QUALITY ASSESSMENT 


A 1 Currit (IBM Corporation) --"Cleanroom Certification Model" 

Mr. Currit described the software reliability model used for 
software certification in the "cleanroom" development ap- 
proach. The cleanroom is a rigorous methodology that sepa- 
rates developers from all testing activities. It replaces 
unit and integration testing with rigorous code inspections. 
Although it is difficult to produce software with zero de- 
fects, it is hoped that this approach will produce code with 
a very low probability of failure. 

Certification of the developed code is dependent on its 
achieving a specified mean time to failure ( MTTF ) during 
testing. MTTF is an appropriate measure because it is unam- 
biguous and relates to the customer's needs. The certifica- 
tion model predicts MTTF based on failure data collected 
during testing. It shows good agreement with published data. 
Although mathematically similar to some popular reliability 
models, it is simpler than most. This MTTF model seems to 
be an effective tool for determining when software is ready 
for delivery. 

In response to questions and comments from the audience, 

Mr. Currit clarified the following points: 

• MTTF is measured in terms of usage months rather 
than CPU execution time. 

• The; cleanroom concept replaces unit testing with 
statistical testing. —Test data are used to calcu- 
late MTTF. 

• Under the cleanroom system, programmers are kept 
away from the computer as much as possible. They 
only get clean compiles of their code and are not 
able to debug programs on the computer. 

T. Babst 
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Kyle Rone (IBM Corporation) — "Projecting Manpower To Attain 
Quality" 

Mr. Rone described the derivation of a model to predict the 
manpower required to insert new technology into a system. 
This model will also aid in defining the distribution of 
manpower needed to achieve maximum quality. 

The development environment studied generates software in 
increments, as a series of releases. The goal of this re- 
search effort is to create a model that matches this strat- 
egy. Increasing the manpower at the beginning of a project 
and moving more quality analysis toward the front seems to 
facilitate the early detection of errors . Mr. Rone believes 
that by following this plan, maintenance costs for the sys- 
tem studied, which annually are now approximately 25 percent 
of the development cost, will be reduced to around 15 or 
20 percent. 

In response to questions and comments from the audience, 

Mr. Rone clarified the following point: 

• Maintenance includes the effort required to fix 

errors documented on discrepancy reports. It does 
not include the effort spent to complete change 
requests. 
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Jorge Romeu (ITT Research Institute) --"An Approach to 
Software Baseline Generation" 

Dr. Romeu discussed the initial results of an ongoing re- 
search effort to define baselines for the management of 
software development. A baseline was defined to be an esti- 
mate of the usual value of any characteristic of a software 
system. 

The analysis was based on data collected by the Software 
Engineering Laboratory. Correlations were calculated be- 
tween effort and other software characteristics, and de- 
scriptive statistics were generated. The ultimate goal of 
this research is to develop guidelines for estimating costs 
and performance characteristics for software development 
based on historical data. The baseline approach is widely 
applicable and easily implemented. 
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INTRODUCTION 


The basic goal of software engineering is to produce the 
best possible software at the lowest possible cost. Many 
practices, tools, and techniques (collectively referred to 
as technologies) have been developed that purport to help do 
this, some of which have become widely accepted in the soft- 
ware industry. However, few of these technologies have been 
effectively evaluated experimentally (Reference 1) . This is 
due in large part to an insufficient understanding of the 
software development process, a lack of recognized standards 
for measurement, and the prohibitive cost of large-scale 
controlled experiments. The analysis described in this 
paper addresses some of these issues. The specific objec- 
tives of this study were to 

i- i i 

• Measure technology use in a production environment 

• Develop a model for evaluating software engineering 
technologies 

• Evaluate the effects on productivity and reliabil- 
ity of some specific technologies 

Eight widely used technologies were selected for study, as 
identified in Table 1. The extent of general use shown in 
Table 1 is the percent of respondents reporting having suc- 
cessfully applied these technologies in a survey by Beck and 
Perkins (Reference 2). 

The data analyzed in this study was collected by the Soft- 
ware Engineering Laboratory (SEL) . The SEL has collected 
data from more than 45 projects during the past 6 years 
(Reference 3). Table 2 shows some of the characteristics of 
these projects. Although a controlled experiment was not 
performed for this study, a carefully matched sample was 
selected for analysis from the SEL data base. The sample 
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TABLE 1. TECHNOLOGY INDICES 


INDEX 

SEL 

MEDIAN (%) 

GENERAL 
USE (%) 

QUALITY ASSURANCE 2 

49 

49 

TOOL USE 2 

49 

NA 

DOCUMENTATION 2 

82 

78 

STRUCTURED CODE 

70 

59 

CODE READ 

20 

44 

TOP-DOWN DEVELOPMENT 

60 

60 

CHIEF PROGRAMMER 

85 

46 

DESIGN TIME 

32 

NA 


Q« 1 FROM SURVEY BY BECK & PERKINS. 

op 

I Composite of several items. 
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TABLE 2. ENVIRONMENT STUDIED 

TYPE OF SCIENTIFIC, GROUND-BASED, INTERACTIVE GRAPHIC, 
SOFTWARE: MODERATE RELIABILITY AND RESPONSE REQUIREMENTS 

LANGUAGES: 85% FORTRAN, 15% ASSEMBLER MACROS 

MACHINES: IBM S/360 AND 4341, BATCH WITH TSO 


PROJECT CHARACTERISTICS: 

AVERAGE 

HIGH 

LOW 

DURATION (MONTHS) 

15.6 

20.5 

12.9 

EFFORT (STAFF- YEARS) 

8.0 

11.5 

2.4 

SIZE (1000 LOC) 
DEVELOPED 

57.0 

111.3 

21.5 

DELIVERED 

62.0 

112.Q 

32.8 

STAFF (FULL-TIME EQUIV.) 
AVERAGE 

5.4 

6.0 

1.9 

PEAK 

10.0 

13.9 

3.8 

INDIVIDUALS 

14 

17 

7 

APPLICATION EXPERIENCE 
MANAGERS 

5.8 

6.5 

5.0 

TECHNICAL STAFF 

4.0 

5.0 

2.9 

OVERALL EXPERIENCE 
MANAGERS 

10.0 

14.0 

8.4 

TECHNICAL STAFF 

8.5 

11.0 

7.0 


SAMPLE: 22 SYSTEMS USING A VARIETY OF TECHNOLOGIES 
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consisted of 22 scientific software systems developed in 
FORTRAN on the same computers to support spacecraft flight 
dynamics applications. 
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TECHNOLOGY MEASUREMENT 


A degree-of-use score (technology index) was determined for 
each of the technologies listed in Table 1 for every system 
in our sample. These scores are based on both subjective 
and objective information. (The table lists the median 
score from the sample of 22 projects.) These scores are the 
percentage of actual use of a technology relative to its 
maximum possible use. The exception is design time, which 
is simply the percentage of the development schedule spent 
in design. 


For those technology indices having only one component (see 
Table 1), such as code reading, the score is the percentage 
of code to which this technology was applied. For those 
technology indices having more than one component, such as 
documentation, the score is the percentage of components 
applied. In the case of the documentation technology index, 
the score is the percentage of documents actually produced 
by a project of those that might be produced in this 
environment. 


This analysis attempted to identify the effects of tech- 
nology use on development team productivity and software 
reliability. Productivity was measured in terms of the 
number of noncomment lines of code designed, coded, and 
tested per programmer hour of effort. Reliability was 
measured as the inverse of the number of errors detected per 
noncomment line of code. 


One assumption made in this analysis is that the effect of 
any technology is incremental. That is, a high level of use 
of a beneficial technology has more effect than a low level 
of use. A technology that is of no value unless applied 
perfectly is of no value at all, because it will never be 
applied perfectly. 
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TECHNOLOGY EVALUATION 


Evaluating the effect of a technology on an actual software 
development project is not easy. In practice,; several tech- 
nologies may be applied together, and other factors such as 
programmer effectiveness and problem complexity also influ- 
ence project results. Boehm (Reference 4) has pointed out 
the difficulty of distinguishing the effects of modern pro- 
gramming practices from those of related factors. Table 3 
lists the nontechnology factors considered in thi,s analysis. 
All of these have been suggested in the software engineering 
literature to affect productivity and/or reliability. 

Thus, the next step of this analysis was to identify the 
major nontechnology factors and to develop a procedure for 
compensating for their effects on productivity and reliabil- 
ity. The analysis of covariance technique (Reference 5) was 
selected to deal with this situation. The Statistical 
Analysis System (Reference 6) software performed the 
computations reported in this paper. 

The technology indices were collapsed for tnis analysis by 
dividing the projects into "high" and "low" groups with re- 
spect to each technology index. Although this causes some 
loss of information, the resulting analysis is also more 
robust. This analytic technique permitted tests of signifi- 
cance to be performed between the high and low groups with 
respect to productivity and reliability after compensating 
for the nontechnology factors (covariates) . 

The two most highly correlated factors from Table 3 were 
initially selected as covariates for productivity and reli- 
ability. Programmer effectiveness and computer use were 
selected as covariates with productivity. Programmer 
effectiveness was also selected as a covariate with reli- 
ability. However, because requirements changes was cor- 
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TABLE 3. OTHER FACTORS 


FACTOR 

MEAN 

CORRELATIONS 

PRODUCTIVITY 1 

RELIABILITY 2 

PRODUCTIVITY 

3.0 

— 

0.51 

PROGRAMMER EFFECTIVENESS 
(WEIGHTED YEARS) 

5.7 

0.53 + 

0.68* 

REQUIREMENTS CHANGES/ 
SUBSYSTEMS 

1.4 

-0.12 

-0.40 

NUMBER OF SUBSYSTEMS 

6 

0.21 

0.03 

NUMBER OF DATA SETS 

11 

0.26 

0.17 

NUMBER OF DATA ITEMS 

328 

0.30 

0.21 

AVERAGE STAFF LEVEL (FTE) 

3.3 

0.10 

-0.09 

AVERAGE MODULE SIZE (NEW) 

193 

-0.07 

-0.15 

COMPUTER USE (HOURS/LOC) 

0.008 

-0.59* 

-0.19 

MANAGEMENT/SUPPORT 
EFFORT (%) 

19 

-0.47 

-0.18 

DATA DENSITY (DATA ITEMS/ 
SUBSYSTEM) 

71 

-0.07 

0.38 + 


^PRODUCTIVITY = DEVELOPED NONCOMMENT LINES OF CODE/PROGRAMMER HOURS 

Reliability = -errors/developed noncomment lines of code 

+ SECOND FACTOR SELECTED. 

•FIRST FACTOR SELECTED. 
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related with programmer effectiveness, data density was 
substituted as the second covariate for reliability. This 

.. : ; nr::.: ' 

prevented collinearity in the model. 

Each technology was evaluated independently in this manner . 
One potential confounding effect recognized in an earlier 
SEL study (Reference 7) and by Boehm (Reference 4) was the 
tendency of technologies to be used together. This makes it 
difficult to isolate the effects of one technology from 
another and poses the possibility that there might be an 
interaction of technologies that this procedure could not 
detect. 

Productivity Results 

This approach to the evaluation of technologies resulted in 
the generation of a class of models (one for each tech- 
nology) of the form 

Productivity = Technology + Programmer Effectiveness 
+ Computer Use 

Together, programmer effectiveness and computer use ac- 
counted for 54 percent of the variation in productivity 
before the effects of any technologies were included in the 
models. Table 4 shows the additional variation accounted 
for by the technology factors. The magnitude and signifi- 
cance of the effect for each technology are also listed in 
the table. Individually, none of the technologies studied 
iri this analysis showed a significant effect on productiv- 
ity. However, this also indicates that any other benefits 
derived from these technologies are not at the expense of 
productivity. 

Early suggestions were that the principal value of modern 
programming practices is primarily in the area of maintain- 
ability. Shephard (Reference 8) indicated that the effects 
of such technologies are more apparent in less experienced 
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TABLE 4. SUMMARY OF 
PRODUCTIVITY ANALYSES 


TECHNOLOGY 

SIGNIFICANCE 

PERCENT 

EXPLANATORY 

INDEX (EFFECT) 

OF EFFECT ( X 1 ) 

IMPROVEMENT 

CONTRIBUTION (X 2 > 

QUALITY 

ASSURANCE 

0.87 

-2 

0 

TOOL USE 

0.77 

3 

0 

DOCUMENTATION 

0.36 

11 

2 

STRUCTURED 

CODE 

0.82 

-2 

0 

TOP-DOWN 

DEVELOPMENT 

0.95 

-1 

0 

CODE READ 

0.45 

8 

1 

CHIEF 

PROGRAMMER 

0.16 

-16 

5 

DESIGN TIME 

0.60 

7 

1 


ISOLATED TECHNOLOGIES HAVE NO DETECTABLE EFFECT ON 

PRODUCTIVITY 
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programmers than in experienced personnel such as those 
studied by the SEL (see Table 2) . Some other environment- 
specific considerations are discussed in the summary at the 
end of this section. Mills (Reference 9) proposed that pro- 
ductivity is a byproduct of quality, that is, a consequence 
of minimizing rework (errors) . We would thus expect differ- 
ences in reliability (quality) to be easier to detect. 

R eliability Results 

This approach to the evaluation of technologies resulted in 
the generation of a class of models (one for each tech- 
nology) of the form 

Reliability = Technology + Programmer Effectiveness 
+ Data Density 

Together, programmer effectiveness and data density ac- 
counted for 63 percent of the variation in reliability be- 
fore the effects of any technologies were included in the 
models. Table 5 shows the additional variation accounted 
for by the technology factors. The magnitude and signifi- 
cance of the effect for each technology are also listed in 
the table. 

Three of the technologies studied in this analysis showed 
significant effects on reliability: quality assurance, 

documentation, and code reading. All of these techniques 
are examples of cc/hscious efforts to understand and verify 
the software product. Approximately 73 percent of the vari- 
ation in reliability can be explained with a model of this 
type. Improvements in reliability were obtained without any 
apparent effect on productivity (Table 4) . Furthermore, 
this implies that skimping on these activities will not pro- 
duce any cost savings for the developer . 






D. Card 
CSC 
11 of 17 


TABLE 5. SUMMARY OF 
RELIABILITY ANALYSES 


TECHNOLOGY 

SIGNIFICANCE 

PERCENT 

EXPLANATORY 

INDEX (EFFECT) 

OF EFFECT (X-,) 

IMPROVEMENT 

CONTRIBUTION (X 2 ) 

QUALITY 

ASSURANCE 

0.02* 

29 

10 

TOOL USE 

0.78 

3 

1 

DOCUMENTATION 

0.04* 

27 

8 

STRUCTURED 

CODE 

0.75 

3 

1 

TOP-DOWN 

DEVELOPMENT 

0.67 

6 

1 

CODE READ 

0.02* 

29 

10 

CHIEF 

PROGRAMMER 

0.56 

8 

1 

DESIGN TIME 

0.96 

-1 

0 

*P < 0.05 
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Summary 

The numerical results just presented must be considered in 
the context of the local software development environment. 
The results for each technology are discussed below. 

• Quality Assurance — A program of regular reviews 
(e.g., system requirements, preliminary design) improves 
software reliability at little or no additional cost in de- 
velopers' time. Time spent on reviews is retrieved by 
avoiding subsequent problems. 

• Software Tool Use - -Ex tensive computer use in gen- 
eral seems to have a negative effect on productivity, al- 
though some specific tools may facilitate specific tasks. 
This index is based on the tools available in the flight 
dynamics environment. None of these tools has a demon- 
strable effect on productivity or reliability. 

• Documentation - -The development of effective docu- 
mentation requires a careful review of the product under 
development. Documentation is, to some extent, a prerequi- 
site for quality assurance reviews, and thus has a signifi- 
cant favorable effect on software reliability. 

• Structured Code --The use of structured code pro- 
duced no significant effect on productivity or reliability. 
However, the benefits of this technique are expected to 
occur in maintenance. 

• Top-Down Developme nt- -The high-level designs of all 
of the systems in the sample Studied were similar, and a 
substantial amount of code was reused from previous sys- 
tems. Hence, it is not surprising that no benefit was iden- 
tified from the use of top-down development in this 
environment. 
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• Code Reading --The simple practice of code reading 
improves software reliability at little or no additional 
cost in developers' time. 

• Chief Programmer --The use of a chief programmer 
team produced no significant effect on productivity or reli- 
ability. However, it may provide other benefits. 

• Design Time --The percent of schedule spent in de- 
sign showed no significant effect on productivity or reli- 
ability. The high-level designs of all systems studied were 
similar, and the software development problem was well 
understood. In this situation, additional design time may 
not improve the product. 
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CONCLUSIONS 


The analysis results presented in the preceding section lead 
to two types of conclusions: those pertaining to the con- 

duct of software development in the local (SEL) environment, 
and those of a more general nature. For the most part, 
these conclusions are consistent with similar work by other 
researchers and with assumptions commonly accepted in the 
software development community. 

The Local Environment 

The results of this analysis provide the following sugges- 
tions for the conduct of flight dynamics software develop- 
ment projects: 

• Use a small team of appropriately experienced in- 
dividuals 

• Do not depend on the computer to do the pro- 
grammer's thinking 

• Read all code developed 

• Effectively document each phase of development 

• Conduct regular quality assurance reviews 

The most important lessons are that developers must be cap- 
able and must consciously seek quality. These conclusions 
will be fed back into the management of subsequent software 
development projects at Goddard Space Flight Center (GSFC) . 

General Implications 

The analytic procedure and some results of this study are 
applicable to more than just the GSFC flight dynamics envi- 
ronment. The general conclusions of the study are as 
follows : 

• Technology use can be measured and evaluated in a 
production environment. 
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• A model that explains much of the variation in pro- 
ductivity and reliability was developed for tech- 
nology evaluation. 

• Limited use of the technologies studied can produce 
up to about a 30-percent improvement. 

Although the improvements identified in this study were in 
the area of reliability, a corresponding decrease in main- 
tenance cost due to a smaller need for error correction 
should also be realized. Furthermore, productivity appears 
to be a companion of quality software development. In addi- 
tion, some technologies may produce other beneficial effects 
in areas not yet studied by the SEL. 

The analysis of covariance model appears to be one appro- 
priate technique for evaluating the effects of technologies 
in this context. However, small improvements in productiv- 
ity and/or reliability that were not detected by this pro- 
cedure might occur. More such evaluation efforts are needed 
to provide an empirical basis for the formulation of soft- 
ware development standards. 
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Abstr act 


This paper describes research con- 
ducted by the Software Engineering Labora- 
tory (SEL) on the use of dynamic variables 
as a tool to monitor software development. 
The intent of the project is to identify 
project independent measures which may be 
used in a management tool for monitoring 
software development. This study examines 
several FORTRAN projects with similar pro- 
files. The staff was experienced in 
developing these types of projects. The 
projects developed serve similar func- 
tions. Because th se projects are similar 
we believe some underlying relationships 
exist that are invariant between the pro- 
jects. These relationships, once well 
defined, may be used to compare the 
development of different projects to 
determine whether they are evolving the 
same way previous projects in this 
environment evolved. 


Ov erv i ew 

The Software Engineering Laboratory 
(SEL) Is a joint effort between the 
National Aeronautics and Space Administra- 
tion (NASA), the Computer Sciences Cor- 
poration (CSC), and the University of 
Maryland established to study the software 
development process. To this end, data 
has been collected for the last six years. 
The data was from attitude determination 
and control software developed by CSC, in 
FORTRAN, for NASA. Additional information 
on the SEL, the data collection effort, 
and some of the studies that have been 
made may be found in papers from the 
Software Engineering Laboratory Series 
1,2,3 

published by the SEL 


This research was supported by the Na« 
tional Aeronautics and Space Administra- 
tion grant NSG-5123 to the University of 
Maryland. Computer support provided in 
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Space Flight Center. 
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The interest in the software develop- 
ment process is motivated by « desire to 
predict costs and quality of projects 
being planned and developed. For several 
years, studies have examined the relation- 
ships between variables such as effort, 

4,5 

size, lines of code, and documentation 
These studies, for the most part, used 
data collected at the end of past projects 
to predict the behavior of similar Dro- 
jeets in the future. In 1 9 B 1 the SEL con- 
cluded that many of these factors were too 
dependent on the environment to be useful 

6 

for the models that had been developed . 
Any model which attempts to trace these 
relationships should therefore be cali- 
brated to the environment being examined. 
The meta-raodel proposed by the SEL is 

6 

designed for such flexibility . 

Another way to isolate out the 
environment dependent factors is by com- 
paring two internal factors of a project, 
thus ignoring all outside influences. One 
approach that is used to monitor software 
development examines the time gap between 
the initial report of software problems 
and the complete resolution of the prob- 
7 

lem . Comparing two variables is useful 
because it also accentuates problem areas 
as they develop, providing relative infor- 
mation rather than absolute information. 
Relative information is useful to the pro- 
ject manager because it accentuates trends 
as the project develops. If project 
environments are similar, then similar 
values should: be expected. Because the 
project environments in the SEL are simi- 
lar, it was felt that this approach could 
be further extended to provide managers 
with information about how a set of vari- 
ables over the course of a project dif- 
fered from the same set of Variables on 
other projects (baselines). The managers 
could be alerted to potential problems and 
use other variable data and project 
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knowledge to determine whether the project 
was in trouble. 

This methodology is flexible enough 
to respond to changing needs. Every time 
a project is completed the measures col- 
lected during its development may be added 
in to calculate a new baseline*. In this 
changes in the environment* as they occur. 

Baselines might also be developed to 
reflect different attributes. For 
instance, several projects which had good 
productivity might be grouped to form a 
productivity baseline. Once baselines are 
established, projects in progress may be 
compared against them. All measures fal- 
ling outside the predetermined tolerance 
range are interpreted by the manager. 


Methodology 

The implementation of this methodol- 
ogy is dependent on two factors. The 
first factor is the availability of meas- 
ures that are project independent and can 
also be collected throughout a project's 
development. Variables like programmer 
hours and number of computer runs are pro- 
ject dependent. By comparing these vari- 
ables against each other a set of relative 
measures may be generated which is project 
independent. For instance, the number of 
software changes may vary from project to 
project. The project dependent features 
shared by each variable will cancel out 
when the ratio of software changes per 
computer run is taken. The resulting 
relative measure is project independent. 

The second factor is the need fo^ 
fixed time intervals common to all pro- 
jects. To normalise for time, project 
milestones were used. The time into a 
project might be twenty percent into cod- 
ing instead of ten weeks into the project, 
for instance. 

When computing the baselines one 
other factor was considered. At any given 
interval during development a variable may 
measure either the total number of events 
that have occurred from the beginning of 
development ( cumulative) or the number of 
of events that have occurred since the 
last measured interval (discrete). Since 
these approaches may convey different 
information it was felt that they both 
should be used. 

For simplicity, the baseline for each 
relative measure was defined as the aver- 
age and standard deviation computed for 
the measure at predetermined intervals. A 
project's progress may now be charted by 
the software manager. At each interval in 
a projects development the relative meas- 
ures are compared with their respective 


baseline. Any measures outside a standard 
deviation are flagged. These measures are 
then interpreted by the project manager to 
determine how the project is progressing. 

A flagged measure may indicate a project 
is developing exceptionally well or it may 
indicate a problem has been encountered. 

The interpretation of a set of 
flagged measures is a three step process. 
First, the manager must determine the pos- 
sible interpretations for each flagged 
relative measure using lists of possible 
interpretations developed and verified 
based on past projects. 

Second, the union of the lists of 
possible interpretations of each flagged 
measure must be taken. The list formed by 
this union contains alt the possible 
interpretations ordered using the number 
of times each interpretation is repeated 
in the different lists. The larger the 
number of overlaps a possible interpreta- 
tion has, the greater the probability it 
is the correct interpretation. 

Third, the manager must analyze the 
combined list and determine if a problem 
exists. Interpretations with an equal 
number of overlaps all have an equal pro- 
bability of being the correct interpreta- 
tion* If none of the possible interpreta- 
tions for a given relative measure overlap 
then the relative treasure should be con- 
sidered separately. 

When analyzing the interpretations, 
three pieces of information must be con- 
sidered ; the measurements , the point in 
development, and the managers knowledge of 
the project. A relative measure may indi- 
cate di f re rant things depending on the 
stage of development, For instance, a 
large amount of computer time per computer 
run early in the project may indicate not 
enough unit testing is being done. Per- 
sonal knowledge may also give valuable 
insight . 

A fundamental assumption for using 
this methodology is that similar type pro- 
jects evolve similarly . If a different 
type of project was compared to this data- 
base, the manager would have to decide 
whether the baselines were applicable. 
Depending on the type or differences, the 
established baselines may or may not be of 
any value. 


EXAMP LE U 

Forty percent into coding a software 
manager finds that the lines of source 
code per software change is higher than 
normal. A list previously developed is 
examined to determine what the relative 
easure might indicate. The possible 
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interpretations for a large number of 
lines of source code per software change 
might be: 

good code 

easily developed code 
influx of transported code 
near build or milestone date 
computer problems 
poor testing approach 

If this were the only flagged measure the 
manager would then investigate each of the 
possibilities. If the value for the meas- 
ure is close to the norm less concern is 
needed than if the value is further away. 

If in addition to lines of source 
code per software change the number of 
computer runs per software change was 
higher than normal, the manager would also 
examine this measure. The possible 
interpretations for a large number of com- 
puter runs per software change might be: 

good code 
lots of testing 
change backlog 
poor testing approach 

The union of the possible interpretations 
of these two measures indicates that the 
strongest possible interpretations are 1) 
good code and 2) a poor testing approach. 
The number of possibilities to investigate 
is smaller because these are the only 
measures which overlap. The manager must 
now examine the testing plan and decide 
whether either of these interpretations 
reflect what is actually occurring in the 
project. If these two possible Interpre- 
tations do not reflect what is happening 
on the project, the manager would then 
examine the other interpretations. 


Ba seline Devel opment 

To develop a baseline one oust first 
have variables whose measurements were 
taken weekly for several projects. Five 
variables in the SEL database were used. 
The lines of source code, number of 
software changes, and number of computer 
runs were collected on the growth history 
form. The amount of computer time and 
programmer hours were collected on the 
resource summary form. Measurement of 
these variables started near the beginning 
of coding. In this study, nine separate 
projects were examined whose development 
was documented, with sufficient data, in 
the SEL database. The projects ranged in 
size from 51 — 1 1 2 K lines of source code 
with an average of 75K. No examination 
was done for the requirements or design 
phases. 

Once the variables were chosen the 


average and standard deviation was com- 
puted for each baseline. Some baselines 
suffered from limited data points during 
the beginning of the coding phase. A cou- 
ple of the projects, in which problems 
were known to have existed, were flagged 
as soon as data on these projects 
appeared, but this was fifty percent of 
the way into coding. It is not known how 
much earlier they would have appeared, if 
data existed at the early Intervals. 

Inte rpretatio n of Re la tive Measu res 


Once a set of baselines are esta- 
blished new projects may be compared to 
them and potential problems flagged. To 
interpret these flagged relative measures 
a list should be developed with each meas- 
ures possible interpretations . Each list 
rausft consider the possible interpretations 
of the relative measure when it is either 
above normal or below normal. What each 
component variable actually measures 
should also be considered when the dif- 
ferent lists are developed. 

A list was developed with possible 
Interpretations for each relative measure 
being examined in the context of the SEL 
environment. In another environment the 
interpretation of these measures might be 
different. These lists are subdivided 
into two categories; above and below nor- 
mal. The above normal category contains 
possible interpretations for the relative 
measure when it is outside one standard 
deviation from the average in the positive 
direction. The below normal category 
refers to interpretations when the measure 
is outside one standard deviation from the 
mean in the negative direction. 

One of the reasons this methodology 
works is because of the implicit inter- 
dependencies between different relative 
measures. To show these interdependencies 
more explicitly a cross reference chart 
has also been provided for each interpre- 
tation to indicate other relative measures 
that can have the same interpretation. A 
number in the cross reference section 
indicates the list number of a relative 
measure that can have the same interpreta- 
tion. The position of the list number in 
the 4-quadrant cross reference section 
indicates whether both interpretations are 
found with above normal values, both with 
below normal v a Lite 2 ♦ or one with above and 
the other with below normal values. 

With these lists a set of flagged 
relative measures may be evaluated. When 
a relative measure is flagged, its associ- 
ated list is examined for possible 
interpretations. Overlaps of this list 
with the lists of other flagged relative 
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List 1: Co»put«r Runs ,per Line of Source Cods 
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Asia tire Measures Examined: 

List 1 - Computer Runs per Line of Source Code 

List 2 - Coaputer Tlae per Line or Source Code 

List 3 - Software Changes per Line of Source Code 

List b - Prograaaer Hours per Line of Source Code 

List 5 - Coaputer Tlae per Coaputer Hun 
List 6 - Software Changes per Coaputer Run 
List 7 Prograaaer Hours per Coaputer Run 
List 8 - Computer Tlae per Software Change 
List 9 - Prograaaer Hours per Software Change 




List 2: Coaputer Time per Line of Source Code 
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measures form the new list of what these 
relative measures together might indicate. 
The more overlaps a particular interpreta- 
tion has, the greater the chance it is the 
correct interpretation.- Interpretations 
with the same number of overlaps must be 
considered equally. The more relative 
measures flagged the more serious the 
problem may be. It is up to the manager 
to determine whether the deviation is good 
or bad. 


Monito ring a Software Project's Dev e lopmen t 

Once the baselines have been 
developed and the lists of possible 
interpretations have been put together a 
software manager may monitor the actual 
development of a project. Example 1 
demonstrated how a single interval may be 
interpreted. The following discussion 
will trace the development of an actual 
project* During the actual use of this 
methodology, influence would be exerted to 
correct problems as soon as they are iden- 
tified. . With this study, we must be con- 
tent to study a projects evolution, 
without hindrance, and see at what points 
problems could of been detected * 

Project twenty 1 was chosen for this 
examination because data existed 
throughout the projects development* In 
most respects project twenty was an aver- 
age project. The project did have a lower 
than normal productivity rate. The lower 
rate may be partially explained by the 
fact the management was less experienced 
when compared to other projects. The pro- 
ject also suffered from some delayed 
staffing. Changes in staffing will bs 


noted when the different time intervals 
are discussed . 

The tables on the following page 3how 
wnich relative measures were flagged when 
project twenty was compared to the base- 
lines for each stage of development. The 
numerical values represent how many stan- 
dard deviations each flagged relative 
measure was from the baseline. The base- 
line for each relative measure was calcu- 
lated using all nine projects. 


S t a r t of C oding ; 

At the start of coding only one rela- 
tive measure is flagged. The smaller than 
normal number of software changes per line 
of source code using the discrete approach 
reflects work done during the design 
phase. The lists designed in the previous 
section were directed towards code produc- 
tion and testing and do not apply to this 
time interval when using the discrete 
approach* This measure may indicate good 
specifications or lots of PDL being gen- 
erated. The manager might want to examine 
this measure later if it constantly 
repeated. Since it is the only measure 
flagged at this time it will be Ignored. 


• The numbering convention used is an 
extension of the one first used by Bailey 
6 

and Basil! * 
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20 % Coding: 

The flagged relative measures found 
using the discrete approach at thi3 point 
represent the work done from the start of 
coding until twenty percent of the way 
through coding. The list of possible 
interpretations for the flagged relative 
measures, generated from the lists made 
previously for the individual relative 
measure, would look like.: 

# overlaps interpretation 


The number of possibilities is larger “With 
this set of possible interpretations. 

Five interpretations are slightly stronger 
than the others. During the actual 
development, the first release of the pro- 
ject was made. The amount of code actu- 
ally written was also lower than normal 
during this period. The use of the 
discrete approach gives a stronger feeling 
that code is not being written. Tran- 
sported code tends to be installed in 
large blocks which can be isolated using 
the discrete approach. 


3 bad specifications 

3 code removed 

2 low productivity 

2 high complexity 

2 error prone code 

1 lots of testing 

1 good testing 

changes hard to isolate 
changes hard to make 
unit testing being done 
easy errors being found 

The strongest interpretations are bad 
specifications and code being removed. If 
the actual history is examined one finds 
that during this period there were a lot 
of specifications being changed. This 
resulted in code which was to be modified 
being discarded and new code being writ- 
ten. During the early period lots of PDL 
was being produced but very little new 
executable code. The list of possible 
interpretations does show that low produc- 
tivity is also a strong possibility. 


401 Coding: 


The flagged relative measures which 
appear using the cumulative approach, from 
this time period on, are stronger indica- 
tors than the ones used in the first cou- 
ple of intervals because the average is 
computed using more data points. The use 
of the discrete approach for the interval 
of twenty to forty percent is still depen- 
dent on three data points. The list of 
possible interpretations for this time 
period is : 

# overlaps interpretation 


1 

1 

i 

i 


low productivity 

high complexity 

error prone code 

bad specifications 

code being removed 

changes hard to isolate 

changes hard to make 

lots of testing 

unit testing being done 

good testing 

easy errors 


501 Coding: 

The relative measures flagged during 
this period are the same as the ones 
flagged at the twenty percent coding 
interval. The deviation from the norm for 
this interval is larger. The larger devi- 
ation may indicate a more serious problem. 
The problem may of been Just as serious 
earlier but without the extra data points, 
that are now available, it could not be 
determined. The possible interpretations 
may be taken from the list developed ear- 
lier, Bad specifications and code removal 
were not factors during this period. The 
next three highest priority interpreta- 
tions were; high co^fc-iexi ty , error prone 
code, and low productivity. In addition 
to this the manager should be concerned 
with the continued appearance of the rela- 
tive measure, programmer hours per com- 
puter run, as seen using the cumulative 
approach. This may indicate a lot of 
testing going on. This in conjunction 
with error prone code as a possible 
interpretation may indicate trouble. Dur- 
ing actual development this period was 
spent developing code for the second 
release. The project manager felt that 
code was still not being developed quickly 
enough during this period. 


6 OH Coding : 

Only one relative measure is shown at 
this interval. The number of programmer 
hours per computer run using the cumula- 
tive approach is lower than normal for the 
third consecutive time. This should con- 
cern the manager because when examining 
the list for this measure one fids: 


error prone code 

lots of testing 

easy errors being fixed 

Since the occurrence of this measure is 
persistent it may indicate that the prob- 
lem was corrected but not enough effort 
was expended to completely compensate for 
the past problems. It might also indicate 
the problem still exists. During the 
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actual project it was found that while a 
lot of code was written, it had not been 
throughly tested. Release two was made 
during this period which could explain a 
heavy test load. Two additional staff 
members were added to the project during 
this phase to aid in coding and testing. 


801 Coding : 

The eighty percent coding interval 
does not show any measures outside the 
normal bounds. The addition of two staff 
members during the sixty percent coding 
phase, as well as the addition of a senior 
staff member during this phase, appears to 
have adjusted the project back along the 
lines of normal development. To fully 
compensate for the earlier problems one 
might expect some of the measures to swing 
in the other direction away from the aver- 
age. The fact this over correction did 
not occur might explain the problems 
encountered in the next section. 


drop off from this high measure is to be 
expected when using the cumulative 
approach. An examination of possible 
interpretations that would apply for this 
period of development include: 


high complexity 
lots of testing 
unit testing being done 
testing code being removed 


A lot of testing is certainly indicated by 
past history. 


Start Acc epta nce Testing: 

The relative measures flagged at this 
interval reflects the build up in testing 
before the start of acceptance testing. 

The list of possible interpretations looks 
like: 

0 overlaps interpretation 


Start of Syste m and Integra tion Testi ng : 

The flagged relative measures at this 
time period reflect the build u'p of effort 
for the third and final release. The list 
of possible interpretations for the col- 
lective set of flagged measures looks 
like: 

0 overlaps interpretation 


3 bad specifications 

3 code being removed 

2 high complexity 

2 low productivity 

\ error prone code 

1 lots of testing 

changes hard to isolate 
changes hard to make 
unit testing being done 
good testing 


3 high complexity 

3 bad specifications 

3 code being removed 

2 error prone code 

2 low productivity 

2 lots of testing 

V changes hard to isolate 

V unit testing being done 

1 good code 

1 poor testing 

changes hard to make 

good testing 

compute bound algorithms 
being run 

easy errors being fixed 


Since the code did have a past history of 
poor testing an unusually large build up 
of testing should be expected. The two 
interpretations that apply most to this 
situation are lots of testing and error 
p rone code . 


Since little code was being developed dur- 
ing the testing period, a large amount of 
testing with errors being found is th.e 
most reasonable interpretation of these 
flagged measures. The early history of 
poor testing may be seen here with errors 
being uncovered late. 


End Acc eptanc e Te sting : 

The two flagged relative measures at 
the end of acceptance testing reflect the 
clean up effort being made on the code. 

An average amount of computer time and an 
average number of computer runs indicates 
that the acceptance testing is going Well.. 
The project was behind schedule due to the 
earlier problems encountered. Clean up 
was done during the acceptance testing 
phase in an attempt to get the project out 
the door as soon as possible. 


50 V System an d In teg r ati on Testing : 

Only one relative measure is flagged 
at this interval. This measure was 
flagged using the cumulative approach. An 
examination of the measure at the previous 
interval shows a very high value. A slow 


As seen in this example, the problems 
that occur during a projects development 
are reflected in the values calculated for 
the relative measures. The methodology 
preposed can be used to monitor projects. 
The number of possible interpretations 
increases with each new flagged relative 
measure. The ordering of the measures by 
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the number of overlaps provides an easy 
method of sorting the possible interpreta- 
tions by priority. Another method of 
sorting the possible interpretations could 
include a factor that considers both the 
number of overlaps and the probability of 
a given interpretation being the cause at 
a given interval. The weighting of 
interpretations for a given interval could 
be calculated using the pattern of 
occurrence of the different interpreta- 
tions which have appeared during the same 
interval in past projects. 

An Alternate A££roach 

Flagged relative measures might also 
be interpreted using a decision support 
system. The data for the various relative 
measures would be stored in a knowledge 
base along with a set of production rules. 
To evaluate a project the values for each 
relative measure would be entered into the 
system. The knowledge base would compare 
th*i relative measures to their respective 
baselines, determine which 'relative meas- 
ures were outside the norm, and interpret 
these relative measures using the produc- 
tion rules. A list of possible interpre- 
tations ordered by probability would be 
generated as a result. 

The difference between a decision 
support system and the approach presented 
in this paper is the method of interpret- 
ing the flagged relative measures. Bach 
production rule in the decision support 
system is the logical disjunction of 
several flagged measures which yields a 
given interpretation. Each production 
rule is assigned a confidence rating which 
is then used to rate the possible 
interpretations. The lists for the rela- 
tive measures provided earlier in the 
paper may be easily converted to produc- 
tion rules using the cross reference sec- 
tion. To develop the production rules for 
an interpretation one must generate the 
various combinations of relative measures 
which might reasonably imply the interpre- 
tation. Some relative measures may not 
imply a particular interpretation unless 
they are found in conjunction with another 
relative measure. Once the production 
rules are known and a knowledge base con- 
structed a decision support system may be 
built. For an example of a domain 
independent decision support system see 
8 

Reggia and Perricone . 


Sum mary 

The methodology presented in this 
paper shewed that invariant relationships 
exist for similar projects. New projects 
may be compared to the baselines of these 


invariant relationships to determine when 
projects are getting off track. 

The ability of the manager to inter- 
pret the measures that fall outside the 
norm Is dependent on the amount of infor- 
mation the underlying variables convey. 

The manager must decide what attributes 
are to be measured (e.g. productivity) and 
pick variables that are closely related to 
them and are also measurable throughout 
the project. As an example, a variable 
like lines of code may be too general when 
measuring productivity. Measuring the 
newly developed code, either source code 
or executable code, would be more informa- 
tive since these variables are more 
directly related to effort, How applica- 
ble an interpretation is for the period 
currently being examined should also be 
considered when ordering the list. The 
variables the manager finally decides on 
are then combined to form relative meas- 
ures. 

One method of interpreting a relative 
measure is by associating lists of possi- 
ble interpretations with it. When a rela- 
tive measure appears outside the norm, the 
list of possible interpretations is con- 
sidered. If more than one relative meas- 
ure is outside the norm the lists are com- 
bined. The more times a possible 
interpretation is repeated in the lists, 
the greater the probability it is the 
cause. How applicable an interpreta t ion 
is for the period being examined should 
also be considered when ordering the list. 
The manager oust investigate the suggested 
causes to determine the real one, 


Conclusion 

The ability to monitor a projects 
development and detect problems as they 
develop may be feasible. The methodology 
proposed showed favorable results when 
examining a past case. 

The use of baselines and lists of 
interpretations for comparing projects 
provides an easy method for monitoring 
software development. Both the baselines 
and the lists of interpretations may be 
updated as new projects are developed. As 
more knowledge is gleaned the accuracy of 
this system should improve and provide a 
valuable tool for the manager. 
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OVERVIEW 


• A GENERAL METHODOLOGY TO MONITOR 
SOFTWARE DEVELOPMENT TO DETECT 
PROBLEMS EARLY 

• THE METHODOLOGY MUST: 

REQUIRE MINIMAL OVERHEAD FOR DATA 
COLLECTION 

PROVIDE AN EASY WAY TO INTERPRET DATA 
BE ADAPTABLE TO CHANGING CONDITIONS 
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METHODOLOGY 


• DEVELOP A SET OF GOOD PREDICTORS FOR 
THE DEVELOPMENT ENVIRONMENT 

• NORMALIZE THE MEASURES TO DEVELOP 
BASELINES BASED UPON PAST PROJECTS 

• COMPARE A DEVELOPING PROJECT TO 
KNOWN BASELINES TO DETERMINE 
DIFFERENCES FROM KNOWN BASELINES 

• INTERPRET THE DATA TO EVALUATE THIS 
DEVIATION 

• IF THERE IS A PROBLEM, DETERMINE HOW 
TO CORRECT IT 


APPROACH 

PERFORM A PILOT STUDY 
TRIAL METRICS, BASELINES 
EVALUATE FEASIBILITY 
(DONE: CARL DOERFLINGER) 

BUILD KNOWLEDGE-BASED SYSTEM 

USING PILOT STUDY METRICS 

IMPROVING INTERPRETATION AND 
KNOWLEDGE MECHANISM 

(JUST STARTED: CONNIE RAMSEY) 

INVESTIGATE OTHER METRICS 
ERRORS 

ERROR CATEGORIES 

(IN PROGRESS: DEBA PATNAIK) 


MEASUREMENT POINTS (Pi) 

COMMON ACROSS DATA BASE OF PROJECTS 
NORMALIZED OVER TIME 
REASONABLE TO MEASURE 

PILOT STUDY MEASUREMENT POINTS: 

START DESIGN 
50% DESIGN 
START OF CODING 
20% CODING 
40% CODING 
50% CODING 
60% CODING 
80% CODING 

START OF SYSTEM & INTEGRATION TEST 
50% SYSTEM & INTEGRATION TEST 
START ACCEPTANCE TEST 
END ACCEPTANCE TEST 


MEASURES (Mi) 

AVAILABLE ACROSS MOST OF PROJECT 
INVARIANT TO SIZE, CALENDAR TIME, ETC. 
AVAILABLE ON SEVERAL PRIOR PROJECTS 
EASY TO COLLECT 

DATA AVAILABLE IN SEL: 

COMPUTER TIME 
COMPUTER RUNS 
PROGRAMMER HOURS 
LINES OF SOURCE CODE 
SOFTWARE CHANGES 

TRAIL METRICS FOR PILOT: 

COMPUTER RUNS/LINE OF SOURCE CODE 
COMPUTER TIME/LINE OF SOURCE CODE 
SOFTWARE CHANGES/LINE OF SOURCE CODE 
PROGRAMMER HOURS/LINE OF SOURCE CODE 
COMPUTER TIME/COMPUTER RUN 
SOFTWARE CHANGES/COMPUTER RUN 
PROGRAMMER HOURS/COMPUTER RUN 
COMPUTER TIME/SOFTWARE CHANGE 
PROGRAMMER HOURS/SOFTWARE CHANGE 


BASELINES/DEVIATIONS 


ASSUMPTIONS: 

METRICS HAVE SIMILAR BEHAVIOR AT EACH 
POINT 

METRICS DO NOT VARY TOO MUCH OR TOO 
LITTLE AT Pi 

PROJECT ENVIRONMENTS ARE SIMILAR 

DEVIATION FROM NORM IMPLIES SOMETHING 
INTERESTING 

PILOT STUDY: 

DATA: 9 PROJECTS IN BASELINE 

BASELINES: METRIC AVERAGE AT Pi 

CUMULATIVE 

DISCRETE 

DEVIATION: MORE THAN ONE STANDARD 

DEVIATION FROM THE NORM 






INTERPRETATION 


SET OF MEANINGS FOR EACH Mi AT EACH Pi 
FOR DEVIATION ABOVE THE NORM 
FOR DEVIATION BELOW THE NORM 

SET OF MEANINGS AT Pi COMBINED 

MOST LIKELY INTERPRETATION DERIVED FROM SET OF 
MEANINGS 

MANAGERS PERSONAL KNOWLEDGE ELIMINATES SOME 
INTERPRETATIONS 

PILOT STUDY: 

MEANINGS ASSOCIATED WITH Mi AT Pi GIVEN BY MANAGERS 

VALUE OUTSIDE STANDARD DEVIATION GENERATES 
MEANING SET 

RANKING BASED ON NUMBER OF TIMES EACH MEANING 
APPEARS 

MEANING + RANKING + PERSONAL KNOWLEDGE = 

INTERPRETATION 


V. Basili 


PROGRAMMER HOURS PER LINE OF 
SOURCE CODE 


CROSS REFERENCE 


TYPE 


INTERPRETATION 


ABOVE BELOW 
NORMAL NORMAL 


ABOVE NORMAL 


— HIGH COMPLEXITY 

1 

2 7 8 9 

- ERROR PRONE CODE 

3 

5 6 

— BAD SPECIFICATIONS 

1 

2 3 

- CODE BEING REMOVED 

1 

2 3 

(TESTING OR TRANSPORTED) 



— CHANGES HARD TO ISOLATE 

7 

8 9 

— CHANGES HARD TO MKAE 

7 9 

— LOW PRODUCTIVITY 

1 

2 


BELOW NORMAL 


— INFLUX OF TRANSPORTED CODE 1 2 3 

— NEAR BUILD OR MILESTONE DATE 6 1 2 3 8 9 

— LOW COMPLEXITY 3 
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SAMPLE MEANINGS FOR PILOT ON 
TENTH PROJECT 


AT 80% CODE: 

TWO METRICS ABOVE NORM, ONE METRIC BELOW NORM 
ABOVE NORM: 

1. NUMBER OF COMPUTER RUNS/LINES OF SOURCE 
(S.D. = 1.6) 

2. NUMBER OF PROGRAMMER HOURS/LINES OF SOURCE 
(S.D. = 1.3) 


BELOW NORM: 

3. NUMBER OF PROGRAMMER HOURS/COMPUTER RUN 
(S.D. = 1.5) 


# OF OCCURANCES 

MEANINGS 

CONTRIBUTORS 

2 

HIGH COMPLEXITY 

1,2 

2 

REMOVAL OF CODE 

1,2 

2 

LOTS OF TESTING 

1,3 

1 

LOW PRODUCTIVITY 

1 

1 

BAD SPECIFICATIONS 

2 

1 

CHANGES HARD TO MAKE 

2 

1 

EASY ERRORS FIXED 

3 

PERSONAL KNOWLEDGE: NO CODE REMOVED 

STANDARD AMOUNT OF TESTING 


PILOT STUDY CONCLUSIONS 


METHOD VIABLE 

WORKED FOR ONE PROJECT STUDIED IN 
DEPTH 

MEASURES WERE EASY TO GATHER 

ADAPTABLE TO CHANGING ENVIRONMENT 
AND KNOWLEDGE 

AUTOMATABLE 

NEXT STEPS: 

ADD OTHER METRIC 
KNOWLEDGE BASED SYSTEM 
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OTHER METRICS UNDER STUDY 


• METRICS: ERRORS AND ERROR CLASSES 

• MEASUREMENT POINTS: SAME 

‘ NUMBER OF TEST 

jl RUNS TO DATE 

I 

|| • BASELINES: SAME 

CUMULATIVE AND DISCRETE 






ERROR 



£C< 
o °o 

oj 2 
to 



300 350 400 450 500 550 600 650 


TIME IN DAYS 


ORIGINAL PAGE 13 
OF POOR QUALITY 
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CHANGES DUE TO ERROR BY CAUSE 


ERROR SUM 



RUNS 


LEGEND: ERRORTYP 



+CLERICAL ERROR 
DESIGN ERR OF 1 
FUNCT SPECS INCO 
OTHER 


GZZZ3 COMPS DESCR INCO 
UZZ ZJ ERR IN LANGUAGE 
ZZZZZi MISUNDERSTAND EX 
CZZZZJ REQ. INCORRECT 


ORIGINAL PAGE [9 
OF POOR QUALITY 
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CHANGES DUE TO ERROR BY CAUSE 


CUMULATIVE FREQUENCY 


309 


200 


I////) COMPS DESCR INCO 
uzzn ERR IN LANGUAGE 
CZZZZ) MISUNDERSTAND EX 
[77/n REQ. INCORRECT 


LEGEND: ERRORTYP 


RUNS 


+CLERICAL ERROR 
DESIGN ERR OF 1 
FUNCT SPECS INCO 
OTHER 
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NEXT STEP 


• WE ARE GOING TO BUILD A KNOWLEDGE- 

BASED SYSTEM 

• HOW WILL THIS SYSTEM BE USED? 

A TOOL FOR MANAGEMENT 

— WILL INDICATE WHETHER A CURRENT 
PROJECT IS ON SCHEDULE 

— AUTOMATED 

— CAN BE UPDATED EASILY TO INCLUDE 
INFORMATION FROM NEW PROJECTS AND 
NEW INTERPRETATIONS AS MORE IS 
LEARNED 

— MANAGER MUST USE HIS OWN 
KNOWLEDGE OF THE PROJECT WHEN 
LOOKING AT THE RESULTS 


BUILDING A KNOWLEDGE-BASED SYSTEM: 


- USE KMS-A GENERAL SYSTEM USED FOR BUILDING 
KNOWLEDGE-BASED TOOLS (AVAILABLE AT UNIVERSITY 
OF MARYLAND) 

— THERE ARE TWO DIFFERENT APPROACHES: 

- PRODUCTION RULES 

- HYPOTHESIZE AND TEST 

WE WILL TRY BOTH AND COMPARE 
METHOD 

1. BUILD RULES FOR KMS 

2. INPUT DATA FROM MANY SIMILAR PROJECTS IN SAME 
ENVIRONMENT 

3. GIVEN NEW PROJECT, CAN COMPARE CERTAIN METRICS 
TO THOSE IN THE SYSTEM IN AUTOMATED MANNER. 
KNOWLEDGE-BASE INDICATES ABNORMALITIES. 


4. UPDATE 
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POTENTIAL SCENARIO 
BETWEEN MANAGER AND SYSTEM 


KB = KNOWLEDGE-BASED SYSTEM 
M = MANAGER 


KB: READY FOR COMMAND 

M: OBTAIN DIAGNOSIS 

KB: STAGE: 

(1) START CODING 

(2) 20% CODING 

(3) 40% CODING 

(4) 50% CODING 

(5) 60% CODING 

M: 8 

KB: GOODNESS OF TESTING: 

(1) GOOD 

(2) FAIR 

(3) POOR 

M: 3 

KB: DIAGNOSIS: 

POOR TESTING PROGRAM 
GOOD CODE 
CHANGES HARD TO ISOLATE 
CHANGES HARD TO MAKE 


(6) 80% CODING 

(7) START SYSTEM TESTING 

(8) 50% SYSTEM TESTING 

(9) START ACCEPTANCE TESTING 
(10) END ACCEPTANCE TESTING 


< 0.60 > 
<0.05 > 
<0.25 > 
< 0 . 10 > 
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SUMMARY 


• CHOOSE MEASUREMENT POINTS (Pi) 

• CHOOSE A SET OF NORMALIZED INVARIANT 
MEASURES (Mi) 

• DEVELOP A SET OF BASELINES FOR EACH Mi 
AT EACH Pi 

• CHOOSE BOUNDS ON DEVIATIONS FROM THE 
BASELINES 

• ASSOCIATE POSSIBLE MEANINGS FOR 
DEVIATIONS (+ AND -) FROM THE 
BASELINES FOR EACH Mi AT EACH Pi 

• DEVELOP A MECHANISM FOR DERIVING 
INTERPRETATIONS 

• INCORPORATE PERSONAL KNOWLEDGE OF 
PROJECT 

• GENERATE MOST LIKELY INTERPRET ATION(S) 
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CHARACTERISTICS OF A PROTOTYPING EXPERIMENT 




JUDIN SUKRI AND MARVIN V. ZELKOWITZ 
DEPARTMENT OF COMPUTER SCIENCE 
UNIVERSITY OF MARYLAND 
COLLEGE PARK, MARYLAND 2074* 

INTRODUCTION 

In 1982, NASA Goddard Space Flight Center began a project to prototype a new proposed 
software system. Since the system, the Flight Dynamics Analysis System (FDAS), was to be a 
source code control system, and not the more typical flight dynamics software which NASA per- 
sonnel were more familiar with, the decision was made to prototype an initial implementation in 
order to gain insights into the actual features needed to build a full FDAS and to evaluate the 
idea of a prototype in the NASA environment. This report describes the status of that project at 
the end of 1983. 


PROTOTYPING 

In developing the prototype for NASA we need to understand what a prototype is. More 
importantly, for NASA, the issue of prototyping must answer the following questions: 

(1) What are the goals of a prototype? Is it to develop the requirements for a product? Evaluate 
its performance? Predict its final costs? 

(2) What are the issues involved? How does one design for a prototype? Does the software 
lifecycle change? Do we want multiple prototypes for different phases of the life cycle? How 
do we use a prototype when built? 

(3) What tools can be used to design a prototype? to build a prototype? to evaluate a proto- 
type? 

(4) How does one measure a prototype? How do you know if your prototype was successful? 
Should you invest the cost and build the full system or abandon the project? What 
SHOULD a prototype cost? 10% of the final product or 50% or even 100%? 

FLIGHT DYNAMICS ANALYSIS SYSTEM (FDAS) 

The Flight Dynamics Analysis System (FDAS) is being built to aid experimenters try alter- 
native flight dynamics models. Currently if an experiment is to be run (e.g., try a new orbit cal- 
culation model), the experimenter must access the Fortran source library, know which module to 
modify, make the changes, test the changes, recreate a new load module, and then run the experi- 
ment. The experimenter must have detailed knowledge of the software. 

With FDAS, the experimenter enters the system, and interacts with a data base, directs the 
system to modify the correct module and aids in the change. Thus changes to software are easier, 
require less time and less expertise about the internals. 

FDAS consists of two major components - a source code control system to manage the 
libraries of software modules needed for each application program, and a form of data abstraction 
allowing applications programmers the ability to write programs using flight dynamics data types 
(e.g., state, cartesian coordinates, vector locations, etc.). These features are somewhat indepen- 
dent and can be evaluated separately. 

In order to manage source code, the applications programmer enters a tree chart of modules 
(the program’s structure). Usually this will be a full system developed by someone else. The 
applications programmer can then tell the system to edit specific modules and to replace other 
modules by new ones. The system maintains the current set of modules for the system, and keeps 
track of which modules have been altered and which ones need to be compiled. In some ways, this 
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mode! is very much like a combination of both the Source Code Control System (SCCS) and the 
MAKE processors running under UNIX systems. 

In order to aid the applications programmer, a form of data abstraction has been proposed. 

A set of standard types have been defined. A programmer may code using these types, and a 
preprocessor converts this code into tftandard Fortran. A generalized input-output structure has 
been defined for data of this type. The programmer may write (PUTOUTPUT) the name and 
value of any datum from one module j and read (GETINPUT) the name and value of that datum 
in another module. An intial design decision was to restrict abstract data to their own statements 
and not mix them with the Fortran statements. 

In order to build the prototype, the following general strategy is being used: 

(1) A subset of the requirements for FDAS were written and a prototype built to those 

requirements. 

(2) Data was collected automatically by the FDAS prototype on user interaction with the 
system. 

(3) The usual Software Engineering Laboratory data on programmer activities were collected 
during the development phase. 

(4) The prototype will be evaluated by four groups representing four different views of the 
system. A group of applications programmers (the "users”) will use FDAS and report on its useful- 
ness in solving their flight dynamics problems, a group from the Software Engineering Laboratory 
will evaluate the FDAS model as an appropriate one for solving flight dynamics problems, a 
research group is looking at FDAS as an example of a source code control system, and the 
developers are evaluating the implementation itself, and issues such as efficiency, size, and exten- 
dability to a full system. 

(5) Beginning in the early spring of 1984, a new task will begin to design the "full” FDAS 
system. The experiences in the prototype will undoubtedly be helpful in designing and building 
the full system, but there is no committment to using either the design or the source code of the 
prototype. 

(6) After the full system is built, it will be compared with the initial effort. The 
effectiveness of the prototype on the final product will be evaluated. Was FDAS cheaper to buildT 
Will it be more reliable? Will it be more efficient? Will it have a better man/machine interface? 

INITIAL EVALUATION j 

The initial requirements for FDAS began early in 1982. The requirements and initial design 
for the prototype were done in the fall of 1982 and the initial implementation of the prototype 
began in January, 1983. As with many software projects, the task was bigger than expected, so an ! 

initial prototype was tested in July of 1983, but the "full” prototype was not available until 
October. The evaluation phase is to last until late February, 1984. 1 

Although it is a prototype, it is not a small system. There are 34K lines of Fortran source 
code running under VMS on a VAX 11/780 computer. Of the 34,000 lines (including comments), 
there are 20,200 lines of executable Fortran source statements. The prototype was installed with 
only one small applications system of 3,000 lines for experimentation. This size of 34K is already 
within the size range of other larger "full” systems built by NASA. 

j Some of the data collected can be summarized by the following table. In addition to FDAS, 
there is data from 11 previous projects monitored by the Software Engineering Laboratory and 
4a ta from two other projects now under development. 


Phase 

11 Proj 

Coot 1 

Coot 2 

FDAS 

Design 

22% 

31% 

31% 

39% 

Code 

48% 

43% 

«9% 

01% 

Test 

30% 

20%* 

0%« 

0%* 
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*- Data still being collected 

As can be seen, historically, coding is over twice the design effort. That is also true with one 
of the current projects and is almost true with the other contemporary project. But it is most 
definitely not true with FDAS. This reflects the high design costs since it had "’never been done 
before.” It also reflects the relatively low priority given to full debugging and testing, up to NASA 
standards, of the resulting code. Since the prototype has a limited lifetime, "hard” problems were 
deleted from the prototype requirements, and users had to live with annoying but non-critical 
bugs. (Note: At the time that this was written, the full data from testing FDAS was not yet 
entered into the data base, so full testing data is not yet available.) 

The time spent in design, can be summarized as follows: 


Hours 

11 Proj 

Cont 1 

Cont 2 

FDAS 

Design 

21709 

5885 

10758 

4508 

Total 

100324 

19085 

34461 

10477* 


Still being collected 

As can be seen, the 10,477 hours represents a sizeable effort, and is beyond the "toy" proto- 
type Btage. 

Just using the system has showu some other useful aspects to the system. One critical com- 
mand, the DEFINE command, has been particularly hard to use, so it will need a better definition 
and documentation in the full system. The overhead imposed by FDAS also seems tolerable. For 
example, with compilation times of 10 seconds standard, a preprocessor overhead of 2 seconds is 
tolerable. In addition, since the linkage time for the application system is 18 seconds, the 3 second 
FDAS overhead on top of this is also small. However, the use of the preprocessor seems unduly 
inflexible and should be revised for the full system, 

A final complexity in this evaluation is the always changing requirements. When originally 
conceived, FDAS would be an experimental system used on a VAX 11/780. However, in the two 
years since the idea was proposed, the operational groups at NASA are interested in the system, 
and would like such a tool on their operational computer - an IBM 4341. Thus part of the evalua- 
tion (new requirements?) is to consider a 4341 implementation, or an implementation that can 
easily be transported to both systems. While this will undoubtedly make a comparison between 
the full system and the prototype harder to do, since the operational environments (and hence the 
projects’ requirements) are different, it is certainly to NASA’s advantage to have built the proto- 
type so that all groups can view it before a final decision was made to build it in one particular 
environment. 

SUMMARY 

The evaluation phase is still going on, so it is not possible to give a full evaluation. How- 
ever, some results are now apparent. 

(1) The source code control aspects of FDAS are useable, and can be developed into a good 
operational system. 

(2) The data abstraction language and preprocessor need to be rethought and the features 
need to be generalized. 

(3) The prototype and the underlying application are both written in Fortran. There is no 
need for that to be so. It should be possible to monitor any source code application package 
regardless of the language in which FDAS is written. 

(4) The use of the prototype has uncovered many minor and major defect in the design of 
inch a flight dynamics analysis system. Some original assumptions made during the design phase 
turned out not to be true under actual usage conditions. 
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Because of these experiences, many defect* in FDAS have been discovered before a full sys- 
tem is built. From the data collected so far, it appears as if FDAS yjjl be a large system when 
built. The development of the prototype should aid NASA ip avoiding costly mistakes later. 
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PROTOTYPING IS OF CURRENT INTEREST 
But is it: 

Quick and dirty throw-away? 

Subset implementation? 

Release 1 of full system? 
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DO YOU MODEL: 

Input-output behavior? 
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USES OF A PROTOTYPE: 

F easibility m 1 full system 
User interface 
Performance 
Costs 
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RESEARCH ISSUES: 




How to measure a prototype? 
What are profiles of a prototype 
(baselines)? 

How to evaluate a prototype? 
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PROTOTYPING MODELS 


Prototype is cheap, system expensive 
Prototype is expensive, system cheap 
Both expensive, but better system 
(more reliable, better user interface) 
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NASA/GSFC FDAS PROTOTYPE 
(FLIGHT DYNAMICS ANALYSIS SYSTEM) 

Now: 

Access Fortran library 
Modify subroutines 
Recompile and link 
Run experiment 

===> Need details of implementation 

FDAS: 

Access FDAS 

FDAS accesses Fortran code 
Modifications easier 

====> Modifications require less time and effort 

M. Zelkowitz 
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FACTORS IN SOFTWARE DEVELOPMENT 

I 

I ■ 

FACTOR IJsual Project FDAS 

Requirements Known I ? 

Size Known ? 

! 

Execution Known ? 

Algorithm design Known ? 

User interface Known ? 

Cost Known ? 
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GOALS OF FDAS: 


Decrease experimental setup time 
Solve more problems than is possible today 
Lower required knowledge of system 
Ease of use of experimental system 
Lower software costs to add to FDAS 


M. Zelkowitz 
U of M 
13 of 22 





FEATURES: 


Source code control 


Data abstractions 


(e.g., state, cartesian) 


Generalized input-output 


M. Zelkowitz 
U of M 
14 of 22 


SCHEDULE 




Requirements - Summer-Fall, 1982 
Implementation - January-June, 1983 


ACTUAL SCHEDULE: 


Requirements - Summer-Fall, 1982 
Release 1 - January- July, 1983 
Release 2 - July-October, 1983 
Evaluation - October-December, 1983 
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SIZE OF FDAS 


Source code - 34K 


Executable Fortran statements - 20.2K 


Application area - 3K 
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EFFORT BY MILESTONES 


Phase 

11 Projects 

Pred. 

Cont 1 

Cont 2 

FDAS 

Design 

22% 

17% 

3il% 

31% 

3j9% 

Code 

%8% 

36% 

, 43% 

63% 

61% 

Test 

30% 

47% 

:26%* 

0%* 

0%* 

Code/Design 

2.2 

2,1 

1.4 

2.2 

1.6 

! . ’ ' . . 

Hours 

Design 

21709 

2045 

5885 

10758 

4508 

Total 

100324 

11835 

19085 

34461 

10477 


* - Data still being collected 
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EVALUATORS 

NASA/GSFC - FDAS for flight dynamics 
CSC SEL - Use of data types 
UNIV. OF MD - FDAS as source code support 
Developers - Evaluate FDAS capabilities 
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EVALUATION CRITERIA 


Usable - How easy to set up 


Flexible - Can user alter code easily 


Adaptable - Can FDAS be altered 


Consistent - Can it be used across applications 


Reliable - Can new applications be added 


Stable - Does is fail 


Speed - How fast does it execute 
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SOME SUBJECTIVE COMMENTS: 

As expected, some hard decisions delayed 

Addition of release 1 to schedule 
Some features dropped 

Reliability not up to usual standards 

But system is not an operational one 

Floating requirements 

Full system on VAX or 4341s? 
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ADDITIONAL COMMENTS 


Some commands redefined 

(DEFINE not well understood) 


Cost of system minimal compared to system overhead 
(preprocessor-2 sec. compiler- 10 sec.) 

(build time-3 sec. link time-18 sec.) 
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SUMMARY 

Still need to complete evaluation - 

More data to collect 
Need to evaluate error data 

Prototype profile reflects quick development 

Problems in user interface discovered early 
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PANEL #2 


TESTING PROCEDURE 


I 


J. Ramsey, University of Maryland 
A. Goel, Syracuse University 
C. Savolaine, Bell Labs 
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Structural Coverage of 
Functional Testing. 


James Ramsey 

University of Maryland 
at College Park. 


Abstract 

A FORTRAN program has been instrumented to produce 
structural coverage measures. The structural coverage 
profiles of functionally generated acceptance tests 
and operational usage are used to examine two areas in 
software engineering: the examination of faults and 
the applicability of reliability models. 

This paper describes a study performed at NASA's Goddard Space 
Flight Center, Greenbelt, Maryland by researchers at the University of 
Maryland at College Park. A ten thousand line FORTRAN program was modi- 
fied to produce a structural coverage metric. After execution, the 
modified program produces a list of executed statements. The program 
was executed using both functionally generated acceptance tests and 
operational usage cases yielding structural coverage measures [CSC 78]. 

The program's software failures during maintenance were recorded. 

The study collected structural coverage data for both acceptance 
test and operational usage and error data about faults revealed during 

: " i : : i ■ 

maintenance. Using these data, some simple questions can be answered 
immediately. "How much of the code is executed by functionally gen- 
erated acceptance testing? (both by individual tests and by the entire 
test suite)". Individually, the test cases execute from 27% to H7% of 

This research is funded by NASA grant NSG-5 1 23 * 
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the executable statements. In total, 56$ of executable statements are 
executed. This percentage does not include statements executed in 
either unit test or system test. 

"How many procedures are executed by functionally generated accep- 
tance test"? Anywhere from 48$ to 69$ for individual tests, for a total 
of 75$ of procedures. 

More complicated questions compare acceptance test coverage to 
operational usage coverage. "Does acceptance test execute the same code 
as operational usage"? Yes, more or less. "Does operational usage 
exercise code not exercised by acceptance test"? Yes, about 8$ of the 
total executed code. The code executed by operational usage but not by 
acceptance test contained a mix of statement types different than accep- 
tance test alone. 

There were eight faults revealed during maintenance. Each fault 
was contained in one procedure; one procedure contained two faults. 

There are not enough faults to reach any firm conclusions, however I 
feel there is enough information to inspire interesting questions. 

Are there faults revealed in maintenance in sections of code unexe- 
cuted in acceptance test? No, although 8$ of the code could contain 
such a fault. If faults had occurred in the untested 8$ then perhaps 
the functional tests could be improved by structural coverage testing. 

Since structural coverage testing would require executing every state- 
ment, it might have executed the code and revealed the fault. 

"Are faults more likely to be revealed in heavily executed pro- 
cedures?" Procedures were classified by the number of times they were 
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executed in operational usage. Half of the procedures were executed by 
more than 90 % of the operational usage cases. About half of the 
revealed faults occurred in this group of procedures (3 of 8). 


Information on each fault was collected using the SEL change report 
form [SEL 82]. Faults are categorized by "time to isolate the error", 
"tne time to understand and implement", and the section "type of 
.-'•ror"# . 

Time to isolate the change seems to be independent of procedure 
•overage. Increased usage seems to be associated with a longer time to 
understand and implement a change. This might be explained by suggest- 
ing that the lightly exercised procedures contain fairly simple code 
whereas the heavily exercised code is, by necessity, more complicated 
and requires more time to modify. There are too few faults to reveal 
any interesting patterns between fault types and procedure coverage in 
operational usage. 


Reference s 

CSC 78] Computer Sciences Corporation, Acceptance Test Methods , 
CSC/TM-7 8/ 6296 , 1978. 


LSEL 82] Guide to Data Collection , SEL-81-101, Software Engineering 

Laboratory Series, Goddard Space Flight Center, Greenbe.lt, Mary- 
land, August 1982 . 


* Time to isolate the error is classified as taking: less than one 
hour, one hour to one day, greater than one day, never found. Time to 
understand and implement the change is classified as taking: less than 
one hour, one hour to one day, one day to three days, or greater than 
three days. Faults are categorized as originating in the: requirements, 
functional specification, design (either involving data or expression), 
external environment, use of language, clerical or other. 
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" Stateraen t~Coveragi™ 
by 10 Acoepfcance Test Cases. 
(Percentage of Maximum) 


Case 

Procs 

Exec 

Assign 

Calls 

Do 

If 

Reads 

Writes 

tl 

50.0 

27.5 

31.1 

27.5 

34.4 

34. 1 

17.6 

6.3 

tla 

48.5 

24.9 

28.3 

1 8.2 

33.1 

32.7 

17.6 

6.3 

tlb 

44.1 

21.2 

23.9 

20.1 

23.6 

27.0 

17.6 

4.9 

t2 

50.0 

27.2 

30.6 

27.5 

34.4 

33-9 

17.6 

6.3 

t2a 

48.5 

24.8 

28.3 

18,2 

33.1 

32.7 

17.6 

6.3 

t2b 

44.1 

21 .7 

24.4 

20.1 

24.8 

27.8 

17.6 

5.3 

t3 

48.5 

24.4 

27.8 

18.4 

32.5 

32.0 

17.6 

5.8 

t4 

60.3 

37.9 

43.3 

37.8 

53.5 

45.3 

32.4 

12.1 

t4a 

54.4 

30.3 

33.8 

26.3 

39.5 

38.2 

32.4 

10.7 

t4b 

44.1 

21.6 

24.3 

20.1 

24.8 

27.6 

17.6 

4.9 

t4c 

52.9 

28.6 

33.3 

24.2 

38.9 1 

36.9 

17.6 

6.8 

t4d 

44.1 

21.6 

24.3 

20.1 

24.8 

27.6 

17.6 

4.9 

t5 

69.1 

47.1 

52.6 

55.7 

54.8 

55.0 

41.2 

12.6 

t5a 

64.7 

39.0 

43.9 

38.5 

45.2 

48.9 

32.4 

10.2 

t5b 

67.6 

41.5 

45.7 

51,7 

48.4 

49.8 

26.5 

7.8 

t6 

67.6 

42.7 

47.4 

51.7 

48.4 

51.8 

29.4 

10.7 

t6a 

55.9 

29.9 

34.2 

24.4 

36.9 

37.8 

26 . 5 

9.7 

t6b 

58.8 

33.7 

37,0 

39.7 

36.3 

43.0 

20.6 

5.8 

t7 

66.2 

39.0 

43.8 

40.4 

44.6 

48.7 

26.5 

9.7 

t8 

66.2 

45.6 

51.2 

50.0 

54.1 

55.0 

38.2 

12.1 

t9 

66.2 

41.0 

46.0 

42.3 

46.5 

50.9 

35.3 

11.7 

tio 

66.2 

40.2 

44.9 

40.9 

45.2 

50.3 

35.3 

11.7 

Union 

75.0 

56.0 

63.5 

68.4 

68.8 

65.1 

41.2 

14.6 

Intersect 

42.6 

18.1 

20.8 

10.0 

22.3 

24.7 

17.6 

4.9 
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Statement Coverage 
by 60 Operational Useage Cases. 
(Percentage of Maximum) 


Writes 


57.4 

31.8 

63.2 

39.8 

66.2 

42.6 

54.4 

29.3 

54.4 

29.1 

52.9 

25.5 

48.5 

23.5 

57.4 

31.6 

54.4 

29.0 

54.4 

29.1 

64.7 

40.5 

54.4 

29.0 

51.5 

30.1 

51.5 

29.9 

51.5 

26.4 

67.6 

41 .7 

54.4 

29.6 

54.4 

29.1 

54.4 

29.5 

54.4 

29.0 

54.4 

26.0 

63.2 

38.5 

44.1 

23.1 

44.1 

22.9 

57.4 

31.7 

50.0 j 

28.7 

54.4 

26.1 

54.4 

29.3 

54.4 

29.5 

63.2 

41.4 

54.4 

28.3 

44.1 

23.2 

48.5 

24.9 

30.9 

13.0 

57.4 

33.1 

54.4 

29.1 

54.7 

40.5 

54.4 

29.3 

64.7 

40.7 

55.9 

29.3 



> i . o ny.u 

r 28.0 1 35.0 
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Statement Coverage 
by 60 Operational Useage Cases. 
(Percentage of Maximum) 
(cont.) 


Case 

Procs 

Exec 

Assign 

Calls 

Do 

If 

Reads 

Writes 

41 

57.4 

30.0 

34.1 

24.2 

36.9 

38.0 

35.3 

11.2 

42 

52.9 

31.4 

37.2 

20.8 

45.2 

43.3 

26.5 

8.7 

43 

54.4 

29.0 

33.1 

20.1 

35.7 

36.5 

41.2 

11.2 

44 

66.2 

40.4 

44.8 

41.1 

45.2 

50.7 

44.1 

13.1 

45 

66.2 

46.6 

51.9 

51.0 

54.8 

57.8 

47.1 

13.6 

46 

64.7 

39.2 

43.8 

38.8 

45.2 

49.3 

41.2 

11.7 

47 

57.4 

30.0 

34.2 

24.2 

36.9 

38.0 

35.3 

11.2 

48 

66.2 

39.1 

43.7 

40.7 

44.6 

49.1 

35.3 

11.2 

49 

66.2 

45.8 

51.1 

50.2 

54.8 

55.4 

47.1 

13.6 

50 

66.2 

41.2 

45.9 

42.6 

46.5 

51.3 

44.1 

13.1 

51 

57.4 

31.1 

34.0 

30.4 

34.4 

42.1 

29.4 

7.8 

52 

54.4 

29.6 

34.0 

20.6 

36.9 

37.2 

41.2 

,11.2 

53 

50.0 

2?.5 

31.3 

26.1 

29.3 

35.5 

26.5 

7.3 

54 

*8.8 

31.5 

34.8 

30.1 

33.1, 

44.2 

26.5 

6.3 

55 

58.8 

33.9 

36.8 

40.0 

36.3; 

43.4 

29.4 

7.3 

56 

54.4 

29.1 

33.0 

28.7 

33.8 

36.5 

29.4 

7.3 

57 

54.4 

29.0 

32.2 

27.5 

34.4 

40.1 

26.5 

6.8 

58 

54.4 

29.6 

34.1 

20.6 

36.3 

36.9 

41.2 

11.7 

59 

50.0 

24.4 

2?.6 

17.2 

31.8 

32.1 

26.5 

7.3 

60 

29.4 

12.3 

14.6 

4.5 

15.3 

14.1 

23.5 

5.3 

UNION 

80.9 

64.1 

71.9 

78.2 

76.4 

77.2 

55.9 

17.5 

INTERSECT 

27.9 

10.3 

12.2 

3.8 

12.1 

11.4 

20.6 

4.4 
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ORIGINAL PrM. ‘S 
OF POOR QUALITY 



Time to Understand and Implement the Change vs 
Number of Times Procedure was Exercised / 
Total Operational Executions. 

(Effort to Isolate the Cause in Parenthesis) 


100% 

90* 

80 % 

70 % 

60 % 

50 % 

40* 

30* 

20* 

10* 

( 1 hour < ) 

(1 h < 1 d) 

( 1 hour < ) 

(1 h < 1 d) 

( 1 hour < ) 

( 1 h < 1 _d ) 

(1 h < 1 d) 
( >1 day) 



< 1 hour 

^nTour "< T day 

“"HT^TTclays** 
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Time to Isolate the Change vs 
Number of Times Procedure was Exercised / 

Total Operational Executions. 

(Effort to Understand and Implement in Parenthesis) 



~"foo? 

(1h < Id) 

( 1 hi < Id) 





(Id < 3d) 


] . 

90? 




r 

80? 


(1h < Id) 

(Id < 3d) 

i 

70? 




..i 

60? 




"i 

50? 

(1h < Id) 



J 

40? 

( 1 hour < ) 



o 

30? 





20? 





10? 


Oh < Id) 





"TiiourTTTay" 1 



‘never “Pound" 


F 

L 


t; ;• 

fl 

t - . 

r r» 

It 

i* 

H 

li 

If 

si 
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Examining Functional Acceptance Testing 
With Structural Coverage Metrics 

James Ramsey 


University Of Maryland 
At College Park 

November 1983 
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Overview 


Functionally generated acceptance tests are examined using 
structural coverage metrics. 

Reliability Models 
Software faults 

Management of acceptance testing process 
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DEFINITIONS 


Functionally generated acceptance test: 

derived from the program’s specifications 

Structural coverage metrics: 
procedure coverage 

How many procedures were executed? 
statement coverage 

How many statements were executed? 

Reliability Models 

Given a history of software failures, predict: 
mean time to next failure 
total number of faults in the program 


J. Ramsey 
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The programs: 

Finished: 

A subset of a large satellite system 

FORTRAN 

68 procedures 

10k lines of source 

4.3k executable statements 

Ten acceptance tests 

not a rigorous sampling of the input domain 
but not trivial 
60 operational use cases 
Fault data for acceptance test and operation 

In progress: 

A whole satellite system: 

FORTRAN 

300 procedures 

50k lines of code 

20k executable statements 

F ault data for system test, acceptance test, and 

operation 
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Structural Coverage of Acceptance Test 


Executable Statement Coverage 
by 10 Test Cases. 


Case 

Procedures 
Executed (%) 

Executable 
Statements (%) 

% Unique 
Code 

tl 

50.0 

! 27.5 

0.0 

t2 

50.0 

27.2 

0.0 

t3 

48.5 

24.4 

0.0 

t4 

60.3 

37.9 

4.4 

t5 

69.1 

47.1 

1.7 

t6 

67.6 

42.7 

0.6 

t7 l 

66.2 

39.0 

0.0 

t8 1 

66.2 

45.6 

1.0 

t9 

66.2 

41.0 

0.0 

tio 

66.2 

40.2 

0.0 

Cumulative 

75.0 

56.0 


Intersect 

42.6 

18.1 



Note: 

44% of executable statements were not exercised in accep- 
tance test. They may have been executed in system / unit 
testing. 
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Structural coverage of 60 executions by users after accep- 
tance test: 


St 
60 Oi 

ructural Coverage of 
perational Usage Cases. 

i 

j 

Procedures 
Executed (%) 

Executed 
Statements (%) 

Cumulative 

Intersection 

80.9 

27.9 

64.9 

10.3 


10% of the code was executed by ALL of the operational 
cases. 


F- 

5 : 
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Are the acceptance tests representative of operational usage? 

This assumption MUST be true if using acceptance test 
failures to predict failures in operational usage. 
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4 


1 


Are the acceptance tests representative of operational usage? 


Might not be valid to use reliability data gathered in accep- 
tance test to predict failures in operational use 

The “mix” of statements in the 8.4% differs from the “mix” 
of statements in the 55.7% 

twice as likely to execute a CALL or IF 

Otherwise, cannot distinguish acceptance tests from opera- 
tional usage cases by their structural coverage numbers 



I 





j 
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No faults were revealed in the 8.4% 

If faults had been revealed in the 8.4%, then there was 
a flaw in the test plan 

chance to augment the tests 

chance to re-evaluate how tests are written 
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F aults 


8 faults revealed in operation 

all repaired by changing one procedure 
one procedure contained two faults 

How are these related? 

Time to isolate the fault 

Time to understand and implement the change 
Number of times the procedure is executed / 60 

Questions: 

Are faults more likely to be revealed in heavily exer- 
cised code? lightly exercised code? 

Are there relationships between time to isolate the 
fault and how thoroughly the procedure is exercised? 

Are “time to isolate” and “time to understand and 
implement” related? 


j. Ramsey 


Are heavily exercised procedures more likely / less likely to 
contain a fault? Enticing but inconclusive with only 8 faults. 


Number of Times Procedure was Exercised / 

Total Operational Executions 


Faults 

Procedures 

100% 

* * * 

P P P P P 



P P P P P 



PPP p p 



l PPPPP 



P P P P P 

1 


P P 

90% I 


PPP 

80% 1 

* * 

PPPP 

70% 


P 

60% 


P PPP 

50% 

* 

PPPP 

40% 

* 

PPPPP 



P 

30% 


P P P 

20% 



10% 

1 : * 

PPPPP 


' ■■■■ 

PP 

0% 


u u u u u 



u u u u u 



u u u 


Half of the 55 procedures were executed by 90% or more of 
operational usage cases. 
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Is there a relation between time to isolate the fault and how 
well the procedure was exercised? 


Time to Isolate the Change vs 

Number of Times Procedure was Exercised / 

! * 

Total Operational Executions 

(Plus Effort to Understand and 
implement the Repair) 



Time to Isolate 
the Change 



< 1 hour 

1 hour < 1 day 

> 1 day 

100% 

90% 

hours 

hours 

days 


80% 

70% 

60% 

50% 

40% 

30% 

20% 

10% 

hours 

minutes 

hours 

hours 

days 

i ; ' 7 " i ■ 

. • 


Time to isolate the fault is related to time to understand and 
implement the fix. 
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Conclusions 


Generated a method of comparing acceptance test and opera- 
tional usage 

Acceptance test is representative of operational usage except 
for the “mix” of statement types (at least in this study) 

Structural coverage metrics may provide insight into 
software faults 


J. Ramsey 
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Future Activities 

The next study will attempt to reinforce the results of this 
study. 


More faults and fault data 

Larger, more representative NASA/SEL program 
Exact order of acceptance test 


I 
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An Error -Specif ic Approach to Testing 

Peter M. Valdes 1 

Amrit L. Goel 2 
Syracuse University 

The main objective of software testing in the soft- 
ware development life cycle is to verify conformance of the 
implemented software with its intended requirements. Such 
requirements include 

1. System requirements 

2. Functional requirements 

3. Programming requirements 

Non-conformance with such requirements causes what are 
known as software errors. 

Specif icying an appropriate testing strategy to 
expose software errors is still an art. Traditional 
approaches do succeed in revealing many errors but none 
is powerful enough to expose all errors. The best that can 
be hoped for is to use a specific test strategy to expose a 
specific error type in specific program locations. It is 
this limitation that we exploit to develop anew approach 
to software testing which we call an error-specific testing 
(EST) strategy. It is in fact a dual to the traditional 
testing approaches. 

■^Research Assistant . ^ 

2 

Professor of Electrical § Computer Engineering, Syracuse 
University, Syracuse, NY 13210 
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The EST approach hypothesizes and tests on specific 
error-types in specified program locations, When applied 
to all error types of interest, it becomes powerful enough 
to satisfy the original objective of testing. 

In the presentation we give highlights of the EST 
approach. Then we show how such an approach can be used 
to expose errors in a simple program, triangle. The material 
presented here is not meant to be self-contained. Mathematical 
results and other features (positive and negative) of this 
testing strategy are discussed in technical reports available 
from the authors. Further work on the use of this approach 
for determining software reliability (a different definition 
than commonly used) is also in progress and will be publisned 
in the near future. 
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An Error - Specific Approach to Testing 


Peter M. Valdes 
Amrit L. Goel 

Syracuse University 
Syracuse, N.Y 13210 
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OUTLINE 


1. Testing 

2. Error- Specif ic Testing (EST) 

3. Related Work 

4. EST Methodology 

5. Assumptions and Limitations 

6. EST of Triangle. 



6.1 Functional Requirements (FR^’s) Decomposition 

6.2 Structural Parts (SP^’s) Decomposition 

6.3 FR-SP Mapping 

6.4 Error Hypotheses 

Function-Based Errors (EF * s) 
Structure-Based Errors (ES's) 

6.5 Test of Error Hypotheses 

Function-Based Error Testing Strategy 
Structure-Based Error Testing Strategy 

6.6 Recording Test Results in the 

FR-EF and SP-ES Matrices 

Extensions of EST Philosophy 
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• The main objective of testing is to verify conformance 

of the implemented software with its intended requirements 
such as 

• ■ | System requirements 

• Functional requirements 

• Programming requirements 

• Non-conformance with intended requirement is known as 
a software-error. 
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Error-Specific Testing 

• Traditional testing strategies can expose embedded 
software errors bui none is powerful enough to expose 
all possible errors - therefore 

• use a specific strategy to expose 
specific error type in specific 
program locations, i.e., Error Specific 
Testing (EST) 

• EST is really a dual approach to traditional testing. 
When applied to all possible hypothesized errors, it 
becomes powerful enough to satisfy the original ob- 
jective of software testing. 
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Error - Specific Testing 


• Focuses on specific error types in specific locations 

• Intuitively appealing and simple to use 

• Number of test cases is bounded 

• Can be automated 

• Permits trade-offs in allocation of resources 
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Traditional Software Error-Specific Testing (EST) 
Testing 

Specify Testing strategy Error-type In a specific 
or strategies program location and an 

appropriate testing strategy 

Expose Different types of Specific error-types In 
software errors in the specified locations 
various program 
locations 

Only the specified error 
(and. some incidental errors) 
is exposed. However, it can be 
used to expose all errors if 
all these errors are tested 
for existence using appropriate 
testing strategies. 


Limitations Not all possible 
errors can be 
exposed 
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RELATED WORK 
Traditional 

Use of non-error specific test 
Strategies, e.g., path testing, 
cause-effect graphing 

Weyuker and Ostrand 

Introduced error-based testing which uses all 
available information in exposing certain types of errors. 

Howden 

Realized the limitations of traditional test 
strategies but used them to expose certain types of 
errors (weak mutation) . 

Clark, et al . 

Used the notion of error-sensitive- testing . 
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EST METHODOLOGY 

| 1. Determine s/w functional requirements (FR^ ’ s) . 

j ' ; i 

2. Decompose code into structural parts (SP^'s). 

3. Hypothesize specific error types of interest for 
each FR^ and SPj . 

I 

4. Specify EST strategy for error types in (3). 

5. Determine test requirements for each EST strategy. 

6. I Optimize test requirements. 

r : : ' i ' ; 

I 7. Generate test cases from the optimized test requirements. 

I 

8. Execute test cases, debug exposed errors, retest the 

| 

| changed code including affected code. 
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ASSUMPTIONS/LIMITATIONS 




. FUNCTIONS REQUIREMENTS ARE CORRECT 

. EST STRATEGY AVAILABLE FOR EACH HYPOTHESIZED 
ERROR TYPE 

. NEED TO TEST FOR EACH HYPOTHESIZED ERROR TYPE 
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; 

A 

:i 

Error-Specific Testing of TRIANGLE , 


I . Functional Requirements Decomposition 

Description 

FR^ IF (A > B > C) then not A 

FR2 IF (A = B = C) then equilateral A 

FR, IF (A = B > C or A > B = C) then Isosceles A 

but not equilateral A 

FR 4 IF (A > B > C and A 2 = B 2 + C 2 ) then right A 

FRg IF (A > B > C and A 2 > B 2 + C 2 ) then obtuse A 

J and A < B + C 2 2 o 

FR fi IF (A > B > c and A < B + C ) then acute A 


i 
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Code 


Statement # 


0 

1 

2 

! 

i 

4 


1 

2 


procedure TRIANGLE (A, B, C) 
if A > B go to 1 
go to 2 

if B > C go to 3 

Print ('Illegal Input') return 


5 

3 

if A = B go to 4 


6 

7 ; 1 


if B = C go to 4 
A : = A * A 


8 


B : = B * B 


9 L.j 


C : = C * C 


10 


D : = B + C 


11 


if A / D go to 5 


12 ; 


Print ( ' Right A ' ) 

return 

13 ^ 

5 

if A < D go to 6 


i i 1 1 


Print ('Obtuse A') 

return 

«| | i 

6 

Print ('Acute A') 

return 

16 

4 

if A = B go to 7 

.......... 

17 


go to 8 


18! 

t 

if B s C go to 9 


19 

8 

Print ('Isosceles 

A') return 

20 

9 

Print ('Equilateral A') return 

21 

end 

procedure 
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Statement Numb.er 
(see TRIANGLE code) 


SP, 

1 

SP 2 

2 

SP, 

3 

SP 4 

4 

SP 5 

5 

SP 6 

6 

SP? 

7,8,9,10 

SP 8 

i 

SP 9 

12 

SP 10 

13 

SP 11 

14 

\ : 

SP 12 

15 

| 

SP b 

j ! 

16 

sp i4 y/; : :■■■■■ 

17 

SP 'l5 

1 ’ 

18 

SP 16 

19 

SP 17 

20 
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Functional Requirement - Structured Parts Mapping 


6P 1 SP 2 8P3 SP 4 SP 5 SP 6 SP ? SPj SPj S^ p S^j S^ 3 S^ 4 8^ $ 6 ^ 
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IV* Error Hypotheses 

Functional-based Errors (EF's) 


EP 1 

Non-satisfaction of FR. 
catching an illegal : 

BP 2 

Non-satisfaction 

of 

FR. 

bf 3 

Non-satisfaction 

of 

FR 

ef 4 

Non-satisfaction 

of 

FR, 

bp 5 

Non-satisfaction 

of 

FR, 

EF 6 

Non-satisfaction 

of 

FR, 


2 

*3 

*4 

*5 

*6 


Structure-based Errors (ES*s) 


^1.1' ES 3.1' ES 5jl' 
ES 6.1' ES 8.1' ES 13.1' 


Incorrect relational operator 


*15.1' ES 10.1 

Note for subscript notation ; Left of dot gives structure part 

number when error is possibly 
embedded. Right of dot gives 
error number for the given 
structured part. 


Incorrect transfer of control 
flow. 


*1.2' 

i 

ES 2.1' ES 5.2' 

BS 6.2' 

ES 8 . 2 ' ES 13.2 

ES 14.1' 

ES 15.2' 

“ 10.2 

I 


“7.1 


ES 7.2 


“7.3 



Ihcorrect Arithmetic Operator 

Incorrect Arithmetic Expression 
(Formula) 

Incorrect Assignment 
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V. TEST OF ERROR HYPOTHESES 

Function-based Error Testing Strategy 

.Assume functional requirements given as 

If (input conditions) then (output conditions) 

« Generate test requirements for every valid and invalid 
combination of the inputs 



input 

Condition 

Valid Combination 

Invalid Combination 

FRi 

-(A 

> 

B > C) 

(A < B) A (B > 

C) 

(A > B) A (B > 

C) 





(A > B) A (B < 

C) 

(A «- B) A (B = 

G) 





(A < B) A (B < 

C) 

(A « B) A (B > 

C) 







etc. 


fr 2 

(A 


B - C) 

(A - B) a (B « 

C) 

(A ji B) A (B “ 

C) 







(A * B> A <B f 

r» V 







(A ji B) A (B ? C) 

fr 3 

<A 

3 

B C or 

(A = B) A (B > 

C) 

(A > B) A (B ji C) 


A 

> 

B * C) 

(A > B) a (B = 

C) 

(A/B) A (B - 

C) 







etc. 


fr 4 

(A 

> 

B > C and 

(A > B) A (B > 

C) A 

(A > B) A (B > 

C) A 


A 2 


2 2 
n + C^) 

(A 2 - B 2 + C 2 ) 


(A 2 jt B 2 + C 2 ) 






j 


etc. 


in 

* 

(A 

> 

B > C and 

(A > B) A (B > 

C) A 

(A > B) A (B > 

C) A 


A 2 

! 

> 

2 2 
B Z + C* 

(A 2 > B 2 + C 2 ) 


(A 2 > B 2 + C 2 ) 



and 

K < B + C) 

(A < B + C) 


(A > B + C) 








etc. 


FR 6 

(A 

> 

B > C and 

(A > B) A (B > 

C) a 

(A > B) A (B > 

C) A 


A 2 

< 

B 2 + C 2 ) 

(A 2 < B 2 + C 2 ) 


(A 2 > B 2 + C 2 ) 



etc. 
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Structure-based Errors Testing Strategy : 

Structural Part Testing Strategy 


Incorrect Relational Operator 


Simple 

relational 

expression (SRE) 

Test 

Cases: 

A 

« 

B, 

A > B 

ofj the 

form A < B 









SRE of 

the 

form A 

< 

B 

Test 

Cases : 

A 

< 

B, 

A B B 

SRE of 

the 

form A 

ss 

B or 

Test 

Cases: 

A 

= 

B, 

A ? B 

A B 











SRE of 

the 

form A 

> 

B 

Test 

Cases : 

A 

as 

B, 

A > B 

SRE of 

the 

form A 

> 

B 

Test 

Cases : 

A 

< 

B» 

A * B 

Incorrect 

Construct 

in a SRE 

1 






SRE of 

the 

form A 

< 

k 

Test 

Cases: 

A 

SS' 

k f 

A* < k 

where k = 

constant 

where A* « 

max 

(domain of A) 

SRE of 

the 

form A 

< 

k 

Test 

Cases: 

A 

- 

k. 

A* > k 






where A* * 

min 

(domain of A) 

SRE of 

the 

form A 

= 

k or 

Test 

Cases : 

A 

— 

k 


A « k 











SRE of 

the 

form A 

> 

k 

Test 

Cases : 

A 

= 

k, 

* 

A < k 

SRE of 

the 

form A 

> 

k 

Test 

Cases : 

A 

K 

k» 

A, > k 

Incorrect 

Relational 

Operator 








and Constant 


SRE 

of 

the 

form A 

< 

k 

Test 

Cases: 

* 

A 

< 

k, A * k. 

A* > 

k. 

! ' 

| 








(A 

< 

A*) A (A* 

< k) 


SRE 

of 

the 

form A 

< 

k 

Test 

Cases: 

A* 

< 

k, A = k. 

A* > 

k) 

j 

; 1 








(A 

> 

A*) A (A* 

> k) 


SRE 

of 

the 

form A 

* 

k 

Test 

Cases: 

A* 

< 

k, A = k. 

A* > 

k. 









(A 

< 

A*) A (A* 

< k) 


SRE 

of 

the 

form A 

> 

k 

Test 

Cases : 

A* 

< 

k, A = k. 

A* > 

k. 









(A 

< 

A*) A (A* 

< k) 


SRE 

of 

the 

form A 

> 

k 

Test 

Cases: 

A* 

< 

k, A ■ k. 

A. > 

k. 









(A 

> 

A*) A (A* 

> k) 
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Testing of TRIANGLE'S ES*s 


*V’$: . , 

Hypothesized Error 

Testing Strategy 

> 

ESi 2 # ES~ \ 9 -E'®e o 9 

Simple traversal of go 

to 


statement 


B ^6.2» ES 8.2' ES 13.2' 



ES 13.2' ES 14.1' ES 18.2 



ES 10.2i 



■*1.1 

Test Cases: (A = B ) , 

(A > B) 

ES 3.1 

(9 - C) , 

(B > C) 

ES 5.1 

(A = B), 

(A ft B) 

I ES 6.1 

<B « C) , 

(B / C) 

ES 8.1 

(A 2 «= B 2 

+ c 2 ) , 

j 

(A 2 + B 2 

+ c 2 ) 

ES 10.1 

(A 2 - B 2 

+ c 2 ) , 


(A 2 > B 2 

+ c 2 ) 

ES 13.1 

(A « B) » 

(A > B) 

E f 15.1 

(B « C) , 

(B > C) 

bL , 

Simple traversal of statements 


7, 8, 9, 10 


ES 7 2 

Simple traversal of statements 


7, 8, 9 t 10 


ES 7.3 

Simple traversal of statements 
i. a o in 


7, 8, 9, 10 
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VI . Recording Test Results in the FR-EF and SP-ES Matrices 


Let My R ^ EF « element of FR-EF matrix 
M SP-£S * element of SP “E S matrix 

Then, assuming w4 have a sufficient error-based strategy 


“fr-ef 

or 

m sp-es 


If test result is negative 
If test result is positive 


If error-based strategy is imperfect 


*Vr-ef 

or 

m sp-es 



If test result is negative but 
test's relative degree of 
imperfection is rdi 

If test result is positive 
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FR-EF Matrix 
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EF^ EP 2 EP 3 EF 4 EFj EFg 


FR, 


FR- 


FR, 


FR* 


FR« 


FR* 



SP-ES Matrix 
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Testing and Error Analysis of a Real-Time Controller 

C. G. Savolaine 

Bell Laboratories 
Holmdei, New Jersey 07733 


1 . INTRODUCTION 

This paper outlines inexpensive ways to organize and conduct system 
testing that were used on a real-time satellite network control 
system. This system contains roughly 50,000 lines of executable 
source code developed by a team of eight people. For a small 
investment of staff, the system was thoroughly tested, including 
automated regression testing, before field release. 

Detailed records were kept for fourteen months, during which 
several versions of the system were written. A separate testing 
group was not established, but testing itself was structured apart 
from the development process. The errors found during testing are 
examined by frequency per subsystem by size and complexity as well 
as by type. The code was released to the user in March, 1983. To 
date, only a few minor problems: have been found with the system 
during its pre-service testing and user acceptance has been good. 


2. THE SYSTEM BEING TESTED 

The Satellite Network Control System (SNCS) is a real-time, mini- 
computer based, call-processing system developed for 
Pic tur ephone [ R ] Meeting Service (PMS). It controls the switching 
of both 1.5 and 3.0 Mb/s digital circuits over a satellite using 
Frequency Division Multiple Access (FDMA) technology. The SNCS 
runs on a dedicated Western Electric 3B-205 computer (similar in 
capacity to a DEC VAX T1/78G) and supports interfaces to: 

1. Earth stations 

2. A customer reservations system 

3. A satellite maintenance center 

4. A computer operator console 

Satellite connectivity requests are sent to the SNCS, which 
verifies these requests and assigns satellite transponder channels 
to each. Every 15 minutes commands are generated and sent to 
microprocessors located in the earth stations that tune the modems. 
The real-time control interface to the microprocessors is 

complicated by inter-dependencies among the commands across earth 
stations. To compensate, a sequencing is generated by the SNCS for 
the commands, which changes with every reconfiguration. The 

central SNCS multiplexes these earth station work lists and 
simultaneously distributes them to the stations, maintaining this 
sequencing, 


TESTING METHODOLOGY 


A prototype of the system was available in February, 1982. It 
needed significant enhancement to provide full service, and it had 
not been thoroughly tested. The methods used in testing the system 
while new versions were being developed concurrently are described 
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here. The next section will evaluate their usefulness. 

The major techniques used were: 

• tester selected from the development team 

m rotation of testing assignment 

*• testing was automated 

• formal testing of all versions 

■• careful tracking of error causes and effort to correct 

■+ deferring correction of low severity errors 

• full regression testing 

«• releasing test cases to user with code 

A person from the development team was assigned the full-time task 
of creating and organizing test cases. The system was divided into 
subsystems and test cases were created consisting of multiple test 

situations per case. Each test case had the objective of testing a 

particular system feature. The running of all cases was automated 

with a difference program used on the output to isolate potential 

errors. This made full regression testing possible. This testing 
was done on each version even though only the last version was 
released to the field. 

The testing assignment was rotated among the group, changing with 
each of the three versions created. Tests were automated and 
conducted by the tester, but problems v after being given a severity 
code, were assigned to individuals ip the development team. The 
correction of bugs having a low severity was deferred to the next 
version to avoid correcting multiple versions. 

Each error was classified into ,.one of three types: omission, 

commission or requirements. The errors due to a requirements 

misunderstanding often stimulated additional documentation to 
clarify the mis-conception The system was divided into nine 
subsystems , and each error was allocated to one of these. The 

subsystems and their errors were then analyzed verses code .size and 
complexity as determined by the McCabe complfcxity measure. m 

For every error, the time was recorded ito find the cause, to fix 
it, and to test it. In addition, the number of iterations through 
the cycle and whether other errors were caused or found in the 
process was recorded. 

Test cases were released to the user along with the code. This 

provided a foundation for their testing efforts as well as serving 
as detailed documentation. By running known good cases in the 
users environment, problems unique to their configuration could be 
identified quickly. 
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4. RESULTS 

By devoting 15*4 of the available resources to testing, an extensive 
set of test cases was created and automated. By rotating the 
person responsible for testing, the testing process became more 
robust and independent of individual traits. It also made a task 
that was perceived to be onerous more palatable. The training 
investment in rotating the testing position was low, since each 
tester was previously in the development team. 

By automating the running and output comparison of the test cases, 
it was easy to run regression tests on a system that was growing 
and changing. The number of test cases grew steadily. Phased fixes 
were manageable because the less critical errors were the ones 
being deferred, and none were delayed more than a few months. 

The bugs correlated well with subsystem complexity and lines of 
code. Nearly half the errors were attributed to omission. Of 
these, half occurred in the two largest and most complex 
subsystems. About half the errors required only a one line 
correction. For the first version, the time to find an error and to 
correct it were equal. For later versions it took longer to find 
the errors than to fix them. 

Inviting the user to participate in generating and reviewing the 
test cases made it possible to gain early user involvement. 
Releasing the test cases with the code gave the user an extensive 
set of test cases upon which to build, and served as examples for 
user training. 

Thus , by formalizing and automating the testing process a 
thoroughly tested, stable system, plus test cases were delivered to 
the user on schedule. 
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TESTING AND ERROR ANALYSIS 
OF A REAL-TIME CONTROLLER 

• System under test 

e Testing methodology 

• Data and Analysis 

- Error distribution 

- Error classification 

• Conclusions 
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SATELLITE NETWORK CONTROL SYSTEM 


RESERVATIONS 

SYSTEM 



SNCS 


I 



EARTH 

STATION 



OPERATOR 

CONSOLE 



EARTH 
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MAINTENANCE 
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TESTING METHODOLOGY 


• Development team personnel 


• Full-time assignment 


• Full regression testing 
e Change management tracking system 
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DEVELOPMENT/TESTING CYCLE 


FEB SEPT DEC MARCH JUNE 

1982 1982 1982 1 983 1983 


1st QTR 
1964 


DEVO 


DEV 1 


DEVELOPMENT 


DEV 2 


DEV 3 


FEST 1 || 


DEV 4 


TEST 2 


[TEST 3| 


INTERNAL TESTING 


PRESERVICE TEST 


FIRST FIELD RELEASE 


USER TEST 1 


USER TESTING 


SECOND FIELD RELEASE I USER TEST 2 


FWST 
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ERROR SEVERITY 


LEVEL TYPE » FOUND 

1 SYSTEM FATAL 5 

2 FUNCTION ERROR 29 

3 ANNOYING 44 

4 TRIVIAL 19 

91 TOTAL 
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LINES OF CODE VERSUS MRS 


20000 

17600 


COH H ELATION --- 

COEFFICIENT = 


LINES OF 


15000 

12500 


THIS POINT WAS OMITTED 
FROM REGRESSION ANALYSIS 


MODULE 


7600 

6000 


2500 



10 15 

NUMB 
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COMPLEXITY VERSUS ERROR TYPE 


COMPLEXITY 

MfcASJHE 

INTERNAL TESTING 
ERROR TYPE. 
OCR 

USER TESTING 
ERROR TYPE: 
OCR 

660 

10 

5 

0) 

1 

4 

4^ 




[ = 36 



y = 20 

66* 

1 t 

7 


3 

7 

i; 

403 

2 

1 

6 

- 

- 

- 

379 

3 

6 

1 

- 

- 

1 

369 

1 

1 

1 

- 

- 

- 

364 

6 

6 

3 

1 

2 

- 

277 

2 

3 

1 

- 

2 

- 

226 

3 

0 

4 

- 

- 

- 

10? 

3 

1 

2 

- 

- 

- 


41 + 29 ♦ 21 =91 

S IS ♦ b = 26 
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COMPLEXITY VERSUS FATAL ERRORS 


MODULE 

1 

2 

3 

4 

5 

6 

7 

8 
9 


COMPLEXITY 

MEASURE 

880 

864 

403 

379 

369 

364 

277 

226 

107 


FATAL 

ERRORS 


1 

2 

1 

1 
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COMPLEXITY VERSUS FATAL ERRORS 


MOOULE 

1 

2 

3 

4 

5 

6 

7 

a 

9 


COMPLEXITY 

MEASURE 

880 

€84 

403 

379 

389 

384 

277 

226 

107 


FATAL 

ERRORS 

1 


1 

f 

2 

1 

1 

7 
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RESULTS 

• Fatal errors occurred in less 
complex modules 

• Non-fatal errors correlated wen 
with complexity 

e M04L errors found in pre-field 
testing were omission type 

e Most errors found in field testing 
were comission type 
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CONCLUSIONS 

• Avoid complex modules 

e In design phase, inspect for 
omission errors 

e In internal testing, look for 
comission errors 
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TRANSFORMATIONS OF SOFTWARE DESIGN AND CODE 
MAY LEAD TO REDUCED ERRORS 


Edward M. Connelly 

Performance Measurement Associates, Inc* 
Vienna, Virginia 22180 

ABSTRACT 


This research investigated the capability of programmers 
and non-programmers to specify problem solutions by developing 
example-solutions and also for the programmers by writing computer 
programs; each method of specification was accomplished at various 
levels of problem complexity. The level of difficulty of each pro- 
blem was reflected by the number of steps needed by the user to 
develop a solution. Machine processing of the user inputs per- 
mitted inferences to be developed about the algorithms required to 
solve a particular problem . The interactive feedback of processing 
results led users to a more precise definition of the desired 
solution. 

Two participant groups (programmers and bookkeepers/ 
accountants) working with three levels of problem complexity and 
three levels of processor complexity were used. The experimental 
task employed in this study required specification of a logic for 
solution of a Navy task force problem . This task involved choosing 
ships from a ship list which identified the ship type, the transiting 
time (the time required for the ship to get from its present position 
to the desired site), and stationing time (the number of days the 
ship can remain on station with available provisions). In addition 
to this specification of ship combinations the participants had to 
specify by the example-solution the range of transiting and stationing 
times required. In another related experiment, participants 
developed FORTRAN IV code to solve the same problems. 

The performance both of programmers and non- prog rammers 
was found to decrease with increasing levels of problem complexity 
and with reduced processor support. For both the groups, errors 
of commission were relatively infrequent compared to errors of 
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omission. It was found that the degree of processor complexity 
was much more influential than problem complexity in predicting 
performance scores. When little computer generalization of 
user input was provided, performance was significantly lower than 
during all other experimental conditions. Results also showed 
that participant-strategy in the generation of problem solutions 
was a significant factor in performance, though years of experience 
and years of education were not found to be good predictors of 
performance. The feedback aids were shown to be most effective 
when they included the logic implied by the example-solutions. 

These experiments demonstrate the effectiveness of the on-line 
use of computer software to create and modify software routines. 

Results also suggest that a measure for evaluating a programmer's 
skill should involve evaluation of procedure that programmers 
use in developing example-solutions, and in designing and writing 
program code. Finally, the superiority of using example-solutions 
with inductive feedback over writing code suggests that the trans- 
formation process provided by the induction might be applied anal- 
ogously to software development. Considering designs and code in 
multiple transformed forms may reduce software errors to a level 
found for example-solutions. 


INTRODUCTION 

Six experiments were conducted, with the same problems used in all 
experiments . The ability of the participants to develop example-solutions 
was evaluated as a function of the participant's background and experience, 
the complexity of the problem to be solved, and the level of processing pro- 
vided by the computer, and the level of feedback aids, when aids were available. 

Experiments 1 and 2 were designed to investigate the ability of expert 
programmers and of bookkeepers/accountants who were not expert programmers 
to develop example-solutions for the hypothetical Navy task force problem. 

The experimental variables for both experiments were problem complexity 
and processor complexity, i.e., the amount of machine processing of user 
inputs * 

Experiments 3 and 4 were designed to investigate the ability of expert 
programmers and non-programmers to develop accurate and complete example- 
solutions using various feedback aids at various levels of problem complexity. 
The feedback aid designs were based on the results of Experiments 1 and 2, 
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where the systematic generation of exampl evolutions, as measured by a 
combinational measure, had been shown to be highly correlated with 
performance (explaining 63% of the score variance). 

Experiment 5 was designed to investigate the capability of expert 
programmers to revise problem solutions* specifications in the form of 
example-solutions in which various numbers of initially incorrect entries 
had been introduced, using the feedback-aids developed in Experiments 
3 and 4 . 

Finally, Experiment 6 called upon expert programmers to develop 
computer code written in FORTRAN IV for various levels of data input - 
a design intended to be analogous to the design of Experiment 1 . The results 
of Experiment 6 were sub-routines written in FORTRAN IV that should 
accept or reject a ship combination, as that combination was correct or 
incorrect. 

The performance measures used in the experiments consisted of 
error measures and strategies measures. Three error measures were: 

a. P-j., the probability that a given ship combination was correctly 
classified as acceptable or unacceptable . 

b. , the probability that a correct ship combination was accepted. 

c. Pjq* the probability that an incorrect ship combination was 
rejected. 

In addition to the error measures above, relative error measures 
were used. A relative error measure was defined as a participant’s 
error score (P^, , P J( 0 on an experimental problem minus his/her 

error score on the pretest problem. The relative error measures thus 
tended to remove the effect of the participant’s innate capability, and, as a 
result, were more sensitive to experiment factors than were the error 
measures alone . 

Two strategy measures were used to detect the frequency with which 
participants used specific strategies. One strategy measure, the combinational 
measure, detected the frequency with which a participant changed only one 
component at a time of each successive example-solution* Another strategy 
measure, a sequence measure, detected the use patterns of the various 
feedback aids. . 
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Results of Experiments 1,2, and 3 in which programmers and 
bookkeepers/accountants provided example-solutions are compared with 
the results of Experiment 6 where experienced programmers wrote 
FORTRAN IV program code for the same problems. Results of the other 
experiments can be found in Connelly (1982 a, b). 

RESULTS OF EXPERIMENTS 1 , 2, & 3 
Processor Complexity and Error Reduction 

First, as expected, more errors occurredduring the work on the 
more complex problems. However, the level of processing, or generaliza- 
tion, of the example-solutions was found to be an important error reducing 
factor, i.e. , a significant reduction in errors occurred when data from 
example-solutions were processed into a standard form and presented to 
the participant. 

•' ! 

Systematic Strategies and Feedback-Aids 

A second result,; and perhaps the most important,; was that participants 
in both categories who performed well tended to use a systematic, step-by- 
step strategy in selecting example-solutions. This result, together with 
the first, noted above, suggested that feedback aids might be designed to 
encourage participants to use systematic strategies, by processing their 
example-solutions and then feeding back the resultant data to suggest possible 
additional inputs. A description of the aid design results obtained in using 
them are given in Connelly (1982 a, b). 

Breadth vs . Depth of Experience 

A third result of the first two experiments applied to the subsequent 
experiments was that the number of years advanced education (i.e. , beyond 
high school) and the number of years of professional experience were found 
to be relatively unimportant factors in predicting performance. 

The lack of a strong predictive relationship between years of higher 
education or years of experience and performance may come as a surprise 
to educators and directors of personnel departments. This result was found 
in all of the experiments, so that very strong evidence is available to support 
the assertion that years of education and relevant work experience are not 
good predictors of problem-solving performance. Additional results suggest 
that the "number of programming languages (used on 1 or more programs)" 
and "number of operating systems used" are better predictors of the capabil- 
ities of computer users/programmers . 
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Low Frequency of Errors-of-Commission 

The fourth result applied to the subsequent experiments was the 
observation that only a few errors of commission occurredduHng the 
generation of the example-solutions . The majority of errors thc,t did occur 
were errors of omission . This intriguing result influenced the design of 
Experiment 6, where FORTRAN IV code was written to solve the same 
problems used in Experiment 1, so that a comparison of error rates would 
be possible . 


RESULTS FOR EXPERIMENT 6 

Two types of errors were analyzed. One type, termed an ,r error 
i. of omission”, referred to an error that resulted in a failure to accept a 

£ correct entity (e.g. , ship combination). When specifying a problem solution 

l with example-solutions , an error of omission could be directly traced 

l to a failure to enter an example of a suitable entity (ship combination). 

£ The second type of error considered was an "error of commission. ” When 

'e example-solutions were used to specify a problem solution, an error of 

| commission corresponded to an incorrect example entered into the processor 

; which was then treated by the processor as a correct example. An error 

I of commission resulted in erroneously accepting incorrect entities (ship 

I combinations). 

» ■ 

1 Errors-of-Omission 

I : 

[ There was little difference in the effect of problem complexity on 

? errors of omission between the two methods of specifying problem solutions, 

t i.e. , by example-solution or by FORTRAN IV subroutines. 

[ Errors-of-Commission 

• 5- r* * * f *, -r* v " ■ ■■ : i 

* When generating example-solutions without feedback aids, the rate 

\ of errors of commission increased sharply at a problem complexity-level 

^ near 20,821 , as measured by Halstead's E Metric (Connelly, Comeau, & 

| Johnson 1981). But, given a suitable feedback aid environment, such as in 

| Experiment 3, this problem complexity limitation could be eliminated, as 

evidenced by the Experiment 3 data in which performance degradation did 
'pT not appear...: 

if' . .... . : ' . . ■■■■ ■■ 
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The most important result regarding errors of commission was 
that specification by example-solutions was superior to specification by 
program code. Analysis of the mean scores from Experiments 1,2, and 3 
provided strong evidence that using example-solutions substantially reduced 
errors of commission compared to using FORTRAN IV program code. The 
3% rate for errors of commission with example-solutions compared favorably 
with 18% for program code. 

Three hypotheses concerning the superior performance of the example- 
solution method seem plausible: 

1. It was working with examples and dealing with each individual 
combination of items one-at-a-time that resulted in a low rate 
of errors of commission. 

2. It was the specification of each combination one-at-a-time that 
alone was important. Consequently, if computer programs 
were developed to specify each solution combination one-at-a- 
time, the rate of errors 'Of commission would be low. 

3. The success of the example-solution method was due, in part, 
to the transformation of example-solutions from one logic form 
into another, such as the ship selection logic (SSL), or into 
several different forms, such as the feedback aids. Thus, it 
was the transformation of logic which enabled the user to view 
the problem in more than one way and that resulted in a low 
rate of errors of commission. Consequently, if program code 
entered by the user were transformed into a different logic form 
and fed back to the user for approval, a low rate of errors of 
commission would be obtained. 

These hypotheses are not alternative hypotheses - all could be true. 

We have strong evidence that the first hypothesis is true. If the second is 
true but not the third, program design and coding methods could be adapted 
to a more combination dependent structure. And finally, if the third 
hypothesis were found to be true, pre-compilation aids could be designed to 
convert the user's program code into another form (while maintaining the 
same program logic) for feedback to the user. 


E. Connelly 
PMA, Inc. 

6 ofl 8 


f P °Off QUALITY 


CONCLUSIONS 


1 . The lack of a strong relationship between "years of higher 
education" , "years of experience" and performance, coupled with 
the strong relationship between "number of computer languages" 
known and "number of operating systems" used, suggests 

that education and experience should not be used as they have 
been in the past for hiring, promoting, determining salary 
level, and assigning tasks. Instead, the number of operating 
systems used, which are better performance predictors, should 
be used until the underlying factors included in each are discovered. 

2. Apparently, the depth of an individual’s experience is not as 
important to performance as is breadth of his experience. 

3. A possible common underlying experience related factor is the 
ability to view problems from alternative viewpoints, or the 
ability to develop alternative approaches to problems - an 
ability that might be enhanced with feedback aids. 

4. The performance prediction capability of strategy measures, 
developed as moment-to-moment measures, not only clearly 
demonstrates that systematic strategies were used by successful 
participants (which led to the design of the feedback aids), but 
also convincingly demonstrates that moment-to-moment measures 
provide the sensitivity to explain considerable performance variance 
(approximately 60% in Experiments 1 thru 4.) 

5 . : The superior performance (fewer errors of commission) achieved 

when using example-solutions and inductive processing to specify 
problem solutions over the performance achieved when using 
FORTRAN IV code may provide a basis for determining the 
underlying mechanism for that success and a means for incoi — 
porating that mechanism into program designing and coding aids. 
Apparently, superior performance was obtained either because 
each combination of the input variables was treated indvidually 
and/or because the example-solutions were transformed into 
another logic form — the ship selection logic (SSL). If the former 
is a significant factor, then aids described in this report should be 
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adapted to program designing- and coding aids. If the latter 
is a significant factor then designing and coding aids should 
be developed to transform the logic provided by the user into 
another form which is then fed back to the user for his review. 

Such a transformation might present the program’s equivalent 
logic. 
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RESEARCH METHOD 

TEST ABILITY OF INDIVIDUALS 
TO SPECIFY PROBLEM SOLUTIONS: 


• EXAMPLE SOLUTIONS 

• FORTRAN IV CODE 


:[wra23B3W»eKir*- 


SIX EXPERIMENTS 


ORIGINAL EXAMPLE SOLUTIONS 

1. PC/IR, PROGRAMMERS 

2. PC/IR, BOOKKEEPERS 

3. PC/FA, PROGRAMMERS 

4. PC/FA, BOOKKEEPERS 


a 


I 


REVISE EXAMPLE SOLUTIONS 
5. PC/FA, PROGRAMMERS 


FORTRAN IV COOS 

6. PC/IR, PROGRAMMERS 


PC = PROBLEM COMPLEXITY 
IR = INFORMATION REQUIRED 
FA = FEEDBACK AIDS 
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PROBLEM STATEMENT 

1. THE SHIPS NEEDED FOR THE TASK 
FORCE ARE: 

• 2 AIRCRAFT CARRIERS — NUCLEAR 
(CVAN) OR NON-NUCLEAR (CVA) 

AND 

• 2 SUBMARINES (SS) 

2. THE TRANSITING TIME MUST BE 5 
DAYS OR LESS AND 

3. THE STATIONING TIME MUST BE 10 
DAYS OR MORE. 
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THIS TASK FORCE CRITERIA SPECIFIES THREE 
COMBINATIONS OF SHIP TYPES AS FOLLOWS: 

• 2 CVA AND 2 SS 

OR 

• 2 C VAN AND 2 SS 

OR 

• 1 CVA AND I CVAN AND 2 SS 


SHIP SELECTION LO GIC (SSL) 



No. of 

Transit Time 

- Stationing Time 

Ship Type 

Ship Type 

MIN MAX 

MIN MAX 

CVAN 

0 



CVA 

1 

1 5 

10 50 

CA 

0 



CGN 

0 



CG 

0 



DD 

0 



SSN 

0 



SS 

2 

1 5 

10 — 50 

AO 

0 



TOTAL: 

_ 3 
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DEMOGRAPHICS 


• YEARS OF EXPERIENCE AND YEARS OF 
HIGHER EDUCATION ARE NOT IMPORTANT TO 
PREDICTING PERFORMANCE. 

• NUMBER OF COMPUTER LANGUAGES 
KNOWN AND NUMBER OF OPERATING 
SYSTEMS USED ARE IMPORTANT TO 
PREDICTING PERFORMANCE. 

• UNDERLYING FACTOR MAY BE ABILITY 
TO VIEW PROBLEMS FROM ALTERNATIVE 


VIEWPOINTS. 
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EXAMPLE SOLUTIONS/ 


FORTRAN IV C ODE 

• EXAMPLE SOLUTIONS AND FEED- 
BACK AIDS YIELDS SAME ERROR 
OF OMISSION RATE AS FORTRAN 
IV PROGRAMS 

• EXAMPLE SOLUTIONS AND FEED- 
BACK AIDS YIELD MUCH LOWER 
RATE OF ERROR OF COMMISSION 
AS FORTRAN IV PROGRAMS 
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ERRORS OF 
COMMISSION 

EXAMPLE SOLUTIONS PLUS 3% 

INDUCTIVE FEEDBACK 


PROGRAM CODE 


17.7% 


HYPOTHESES 

SUPERIOR PERFORMANCE OBTAINED 
WITH EXAMPLE SOLUTIONS MAY 
BE DUE TO: 

• WORKING WITH EXAMPLES 

OR 

• WORKING WITH EACH SOLUTION 
ONE-AT-A-TIME 

OR 

• THE TRANSFORMATION FROM ONE 
FORM TO ANOTHER (EXAMPLES TO 
EQUIVALENT LOGIC) 
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EXTENDED ABSTRACT 

“You can observe a lot by just watching "* 

How Designers Design 2 

David Littman*, Kate Ehrlich*, Elliot Soloway*, John Black** 

Department of Computer Science* Department of Psychology*' 

Yale University 

New Haven, Connecticut 06520 
(Please address all correspondence to Elliot Soloway) 

1. Introduction: Motivation and Goals 

Rather than developing design languages and support environments on the basis of what we 
think designers should be doing, we felt that a more informed process would be to first find out 
what they do do. To this end, we interviewed for two hours each 4 expert software designers and 
2 novice designers as they designed an electronic mail system; subjects were encouraged to talk- 
aloud as they worked, and the design session was video-taped. Here we briefly summarize the 
key observations based on an analysis of these tapes. 


2. Subjects and Task 

All designers were professionals supplied to us by a nearby branch of ITT. Expert designers 
had at least 8 years of design experience in commercial settings, while novices had less than 2 
years of similar experience. Note, however, that the novices were without question bright, 
competent individuals; they simply had less experience than the experts. Subjects were given the 
following task: 

TASK -- Design an electronic mail system around the following primitives: READ, REPLY, 
SEND, DELETE, SAVE, EDIT, LIST-HEADERS. The goal is to get to the level of pseudocode 
that could be used by professional programmers to produce a running program. The mail system 
will run on a very large, fast machine so hardware considerations are not an issue. 


*Quote from Yogi Berra, a catcher for the New York Yankees. 

2 Thii work was sponsored by a grant from ITT, 


E. Soloway 
Yale 
1 of 4 


ORIGINAL PAGE SS 

OF POOR QUALITY 


3. Observation I: How the Design Progressed 

All our expert designers considered the same topics, almost always in the same order, and 
usually at the same level of detail. This surprisingly consistent observation Jed us to posit the 
concept of a session meta-plan, which we believe guided the expert software designer's treatment 
of the electronic mail system. Novices did not seem to use anything analogous to a common plan 
of attack on the mail system problem: their design sessions were less systematic than those of 
the experts. 

As illustrated in the time line shown below, the meta-plan of our experts contained five distinct 
phases: firsjt the experts described how a user would view the mail system, then they stated 
various assumptions (e.g, we will use dumb terminals); then experts used models of mail systems 
at various levels of generality (e%. ; at the most general level was the flow of information model, 
followed by examples of other mail systems they have known followed by the specific system at 
hand); finally, the experts worked on the concrete design. Notice that the novices dove right into 
the detailed specifications of the system. We asked all subjects to provide a wrap up evaluation 
at about 10 minutes before the end of the session. 

Sturt Finish 

NOVICES: concrete design,,,., . * .wrsp-up 

“100 mins "10 mins 

EXPERTS: user. . . .essump- abstract concrete. . .wrap-up 

model tions models of design 

mail system 

"10 mins "10 mins “80 mins "10 mins 

The following quotes taken from the protocols are representative and support the above claims: 

At 3 minutes into the task, one novice said: 

(Writes SAVE) “To save ! have to open a file and then write to that file... If I have 5 or 6 
messages I have to consider if I want to save all of them or whether 1 should save a specific one 
and specify which one I am saving.” 

Similarly, at 3 minutes into the task, one expert said: 

“I guess I have to establish a set of assumptions of my own” 

At 10 minutes into the task, one novice said: 

“The number of the message line has to be specified... In order to get the message,., if I have 4 
messages, I need to know which lines I’m going to take if the user only wants to save one 
message,.” 

Similarly, at 10 minutes into the task, one expert said: 
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“Lei me start looking at the states of a user, first of all, and what the world is going to seem 
from the view of a user." 


4. Observation II: Design Strategies 

We observed that experts all employed the following four general design strategies — and the 
novices did not. 

h The experts were purposeful: expens continually stated explicit goals and subgoals, 
and continually checked to see how their design satisfied those goals. For example, 
one expert said: 

“I want l.o go backwards for a minute. I want to think about how 1 got to here and 
from here to there and how I’m now going to go back to the user. OK I’ve got it." 

is contrast, novices operated in a more bottom up fashion: they pursued goals as 
problems came up, without a global sense of where they were going. 

2. The experts were model-directed : experts drew on their experience and continually 
manipulated models of the mail system at various levels of abstraction, e.g., at the 
most abstract level, one expert viewed mail as a stream of incoming data that 
needed to be routed to the appropriate place. These models were used to set up goals 
to be pursued. 

3. The experts always followed a course of balanced development ; components of the 
system were designed in a breadth-wise fashion: at each level, the detail of each 
component was about the same. For example, one expert said: 

Subject: “So I’m trying to keep all the things level " 

Intervierwcr: “Knowing a little bit about each one." 

Subject: “Knowing a little bit about each one. The same level of complexity with 
each one and hopefully the questions may have,., and as you’ve seen before 
sometimes when I ask a question about one thing it reminds me of another thing I 
had passed over before and if I’m at the same level of decomposition 1 can see some 
links between them." 

In contrast, novices plunged into the details of a specific component only to find 
when they came to the next component that assumptions and constraints of the 
earlier component were violated — and thus bugs were introduced. 

4. The experts employed a variety of notes that they used during the design: 

• Assumptions: these notes set out the parameters of the system; they were 
typically specified early in the design, e.g., we will be using dumb terminals. 

• Constraints: as components were being defined certain properties that would 
have a global effect needed to be noted, e.g., in working on the REPLY 
command a constraint was set up that the buffer pointer to the current 
message should not be updated by the READ command. 

• Expectations: these notes set up demons that would interrupt the designer at 
key points in the process, e.g., in reviewing the LIST-HEADERS command, the 

designer realized that the data structure for the mail messages better permit 
access to the subject field, as well as the contents field. 

The notes were used by the experts to continually monitor and evaluate the progress 
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of the design. 

5. Implications for the Software Aids 

What are the implications of these observations for the design languages and support software? 
Where did the experts need assistance? It is clear to us that information management was a key 
skill that experts had, but which they could use sow assistance on — especially when the 
complexity of the task grows large. However, the type of information management that we think 
designers need is not simple “version management”; this type of assistance merely regurgitates 
back to the user exactly what he/she has typed in. Rather, the software aids that we see 
relevant to enhancing the design process are those that can digest the information provided by 
the designer. In particular, one aspect in which the designers seemed to need assistance was in 
the keeping track of the “notes*’ they made (the assumptions, expectations, and constraints) and 
recalling them at just the appropriate time. Software that could perform this type of assistance 
would require considerable understanding of the design process itself, and information that is 
problem specific. 

For example, in designing an electronic mail system, assume the designer noted the following 
assumption to the software aid: 

Assumption: use only dumb terminals 
Reason; keep costs down 

Then later when the designer was working on, say, the SEND command, and contemplating how 
a message could be edited, the software aid should respond with: 

Careful: you assumed that dumb terminals would be used; this type of terminal 
does not have local editing capability 

This type of reminding assistance would provide powerful assistance to an expert. Moreover, it 
might help a novice designer learn good habits, by encouraging him/her to carry out the design 
using notes about assumptions, expectations, and constraints. 

6. Concluding Remarks 

The verbal protocols we collected and analyzed from our subjects provide a tantilizing glimpse 
into the process of design. While even in this small pilot study we saw clear convergence of 
techiques among the experts — and clear differences between the novices and the experts, we see 
the observations made in this paper as only a beginning. We feel strongly that studies of the 
type reported here are necessary in order to get a better understanding of design - which in turn 
can knowledgeably inform the development of design aids. Yes, Yogi, you can observe a lot by 
just watching! 
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evaluating multiple coordinated windows 

FOR PROGRAMMER WORKSTATIONS 


Ben Shneiderman* * Charles Grantham, Kent Norman#, Judd Rogers, 
and Nicholas Roussopoulos* 

"Deoartment of Computer Science 
♦Department of Psychology 

Human-Computer Interaction Laboratory 
University of Maryland 
College Park, MD 20742 


October 14, 1983 


ABSTRACT: Programmers might benefit from larger screens with 
multiple windows or multiple screens, especially if convenient 
coordination among screens can be arranged. This research 
project explores uses for multiple coordinated displays in a 
programmers workstation. Initial efforts focus on the potential 
applications, a command language for coordinating the displays, 
and the psychological basis for effective utilization so as to 
avoid information overload. Subsequent efforts will be devoted 
to implementing the concepts and performing controlled 
psychologically oriented experiments to validate the hypotheses. 


INTRODUCTION 

Full screen display editors are rapidly replacing line oriented 
editors, because they offer a larger window and more intuitively 
clear operations. Comparative studies indicate display editors 
can be learned in half the time and permit twice the productivity 
for many tasks (Roberts, 1979). 

Similar productivity gains may be possible by further expanding 
the personal workstation to include multiply coordinated windows. 
Multiple windows have been used in graphics systems where one 
screen provides command facilities for the graphic display. 
Applications with complex information .display, requirements often 
employ multiple computer displays, e.g. nuclear reactor control, 
air traffic control, manufacturing control, spacecraft control, 
and commodity exchanges. 

Multiple display research in programmer workstations has been 
conducted by the Japanese (Mano et al,, 1982), in the Spatial 
Data Management project at Computer Corporation of America 
(Herot, 1980) , and by Xerox with their overlapping windows 
strategy {Smith et al. , 1982). This latter approach, often 
called the "cluttered desk model", allows the user to create 

Presented at EIGTH ANNUAL SOFTWARE ENGINEERING WORKSHOP, NASA, Goddard 
Space Flight Center, 11/30/83 
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multiple windows in which independent processes can be initiated. 
Other researchers are developing the software architectures 
necessary to support multiple window activity (Gonzalez, 1982? 
O'Hara, 1983, Weiser et al., 1983). 

Larger displays and multiple windows are attractive (IBM, 1983) , 
but can overwhelm the user with too much information and the 
frustration of having to issue many commands to accomplish their 
tasks. 


RESEARCH DIRECTION 

In this project we propose to go beyond these early efforts and 
evaluate a multiple window environment in which the activities 
across windows can be coordinated to support programming tasks. 
Appealing applications include: 

1) A central window shows program text, while the left window 
shows input test cases and the right window shows output 
results. Each press of a function key moves the left window 
to the next test case and the right window to the next output 
case £ The programmer can then examine the code and verify the 
correctness of the output or track anomalies. 

2) One window shows program text and as the cursor is moved 
onto a variable, the declaration, recent values, and 
cross-reference list automatically appear in another window. 

3) One window shows the module design specification, another 
window shows the flowchart, and the third window shows the 
program code under development. As the user enters the name 
of another module, the specification, flowchart, and code 
appear simultaneously. 

4 ) The top-down structure chart appears in one window, and as 
the user moves the cursor onto one of the boxes, the code 
and/or specifications appear in other windows. 

5) Three windows show a contiguous section of a program 120 
lines long, 40 lines per window. The command DOWN 25 causes 
all three screens to move down 25 lines. 

6) With a single command the user can display all three (or 
more) modules invoked by a higher-level module, to check for 
Commonality of argument passing strategies. 

The list could be made much longer, but these examples convey the 
rich potential for multiple windows, if useful coordination and 
synchronization can be achieved conveniently. Multiple screens 
are advantageous for situations which require correlation between 
two segments of text, fuller context for comprehension of local 
code, and concurrent viewing of the root, sub-tree, and leaves of 
a tree structure. 




E. Grantham 
U of M 
2 of 11 


ORIGINAL PA4gg ggj 
OF POOR QUALITY 


We are in the process of designing a language to specify 
window coordination. Our initial approach is to use text editor 
macros to create a set of commands which would fit in the editor 
environment. For example# the macro nc (for Next Case) might be 
specified as 1>L /**/; 3>L /CASE/ which means locate on screen 1 
the string ** (a marker fpr the beginning of an input test case) 
and simultaneously locate on screen 3 the string CASE (a header 
field for each output case) . Conjointly# we will study 
programming behavior to isolate those tasks which can benefit 
from the user of multi-screen informat ibn presentation 
strategies. 

We are in the process of designing a three screen programmer 
workstation to test alternative strategies. We hope to refine 
successful strategies by using the initial system for our own use 
and to test the system with programmers recruited to perform 
benchmark tasks. In addition to producing a useful system# we 
expect to develop a better understanding of how programmers do 
their work. An additional benefit would be the development of 
simplified strategies for coordinating split screens on single 
display systems - these concepts might be rapidly applied to 
currently available programmer workstations. 


EVALUATION STRATEGY 

Our early experiments will concentrate on comprehension tasks 
which can be administered in a well-controlled manner 
(Shneiderman# 1980). For example# we have observed that in-line 
comments. tend to clutter the listing and cause more window 
movement commands to study a program. There are three 
experimental conditions: 

1) Single screen with in-line comments - the control group. 

2) One screen with program text only and one screen with 
comments only. A single window movement command will cause 
both screens to move in synchrony. 

3) Two screens which are linked together to show twice as many 
lines of program text with in-line comments. The screens are 
linked so that they act as simply a doubly long window. 

Subjects will be given a comprehension test forward trace (for a 
given input what is the output)# backward trace (for a given 
output: what must the input have been)# value of variables, counts 
of execution, and other questions. Subject evaluations 
complement the objective test scores. 


As our implementation becomes more powerful we will explore 
program debugging# modification, and composition tasks. 


Acknowledgements ; we are grateful to IBM Federal System Division 
for support of this project. 
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PROGRAMMER TASKS TO BE EVALUATED 
(all require comprehension) 

A - Composition 
B - Testing 
C - De-bugging 
D - Modification 
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FOCUS IS ON COORDINATED USE OF INFORMATION FOR EACH GIVEN COGNITIVE TASK 

Summary of Observational Study Results 


TASK COGNITIVE ACTIVITIES 


COMPOSITION 

! ' . ; - 
j ' 

• Data Structure 

INTEGRATION • Control Structure 

• Modular Design 

TESTING 

CORRELATION J E^c.ed u.put 

DE-BUGG1NG 

VARIANCE • Semantics 

FROM 

PLAN • Syntax 

MODIFICATION 

REFERENCE • Semantics 

+ 

LOCATION • Specifications 













EXAMPLE; 



CONFIGURATION A 


PROGRAM COMPOSITION 
TASK 




b 


r 



Reference to a module causes all screens to move in a linked fashion. 

Screen 2 is the ‘work area’; Screen 1 displays specifications: Screen 3 — that portion of the 
Structure Chart where the module appears. 
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(User Definable Area) 


Second Module 






SOFTWARE MACROS 


SEMANTIC ORGANIZATION: 

One screen will be control target. Action (input) on 
this screen causes correlated changes in other screens. 


& 

* 



i £*' 

p 




In Configuration A: Screen 2 

In Configuration B: Window 3 


POTENTIAL SYNTAX: 

■ “ 

if ; • 


MACRO NC 
1>L/*#/ 
3>L/CASE/ 
END MACRO 


K. 



r 


(macro definition) 


|V' 



H0 
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Introduction 


The "Cleanroom" software development methodology is designed to take the 
gamble out of product releases for both suppliers and receivers of the 
software. The ingredients of this procedure are a life cycle of execut- 
able product increments, representative statistical testing, and a standard 
estimate of the MTTF (Mean Time To Failure) of the product at the time of 
its release. 

In the paper we consider a statistical approach to software product test- 
ing using randomly selected Samples of test cases. A statistical model is 
defined for the certification process which uses the timing data recorded 
during test. A reasonableness argument for this model is provided that 
uses previously published data on software product execution. Also in- 
cluded is a derivation of the certification model estimators and a compar- 
ison of the proposed least squares technique with the more commonly used 
maximum likelihood estimators. 

A Statistical Model of Software Reliability 

If there are errors in a software product, users may experience inter- 
mittent failures as the product is executed. Unlike the possibility of 
intermittent failures in hardware, these intermittent failures in software 
are repeatable -- that is* if the software is executed again under iden- 
tical initial conditions, then the failures will occur in exactly the same 
places. The appearance of intermittent failure in software in a given 
instruction seeming to fail one time and not another is due to the com- 
plexity of circumstances in which the instructions are executed rather 
than in underlying physical problems that occur during the execution of 
the instruction. 

In the case of hardware failures, the basis for a statistical model 
appears in the very physical behavior of the hardware. But in the soft- 
ware, we must find another basis for statistical behavior. Fortunately, 
that basis is close at hand — it is in the nature of the usage of the 
software by various users. Any particular user will make use of the 
software from time to time with different initial conditions and differ- 
ent inputs. During any specific use of the software, inputs may be 
entered from time to time and outputs observed from time to time during 
the course of the execution. The only failures detectable in the software 
are either from its aborting or from producing faulty output. But any one 
execution from a fixed initial condition from fixed inputs will behave 
similarly for every user every time they use it. 

We call any such fixed use an "execution" which is distinguishable from 
all other executions by its initial condition and Its Inputs. Any given 
execution may have one or more failures associated with it, which is 
determined by the software itself as compared to the specification it is 
intended to satisfy. An execution will require a fixed number of 
machine cycles. 
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Now, we can imagine an "execution lifetime" for any given user to be the 
sequence of executions the user calls for with the software. Such se- 
quences of executions for each user can be assembled into a collection of 
sequences of executions -- one for each user -- and the statistical proper- 
ties of this collection identified as a stochastic process. That is, we 
consider, for the software product, a statistical pattern of usage for 
the product in terms of its initial conditions and inputs. Any execution 
selected in such a stochastic process will in general depend upon the 
past history of the sequence. For example, it is very unlikely that a 
user will query files before the files are loaded or that a user will 
call for two successive file maintenance executions. These kinds of 
conditions can be represented in a stochastic process which defines prob- 
abilities at any point in time to depend upon the state of the past history 
of the process. 

With a statistical basis of user usage of the product, we can determine 
various statistical measures such as the MTTF, or the variance around the 
MTTF, etc. where time is measured in machine cycles. We are interested 
in failure free execution intervals, rather than trying to estimate the 
errors remaining in a software design. Our objective is to measure 
operational reliability which is the reason for the user usage perspec- 
tive. 


The Effect of Engineering Changes on the MTTF of Software 

Consider a software increment under test and certification in which 
failures are observed and the results returned to the development 
group. On the analysis of these failures, the development group may. 
propose engineering changes to correct the software. These engineering 
changes can increase the MTTF of the software, and we wish to account 
for that increase in the MTTF. 

When engineering changes are made to software, it is only prudent to 
undertake regression testing to insure that these changes have not 
created new failures in execution. This regression testing should use 
previously generated statistical tests. It goes without saying that 
this regression testing cannot be considered part of the statistical 
sample used for estimating the reliability of the software. Instead, 
the increased MTTF, if any, must be detected and accounted for by new 
samples independent of the old ones (very likely new samples in later 
increments in which the retested software is only part of the total 
software being tested). 

Suppose at a certain point in time that a set of engineering changes 

EC,, EC„, EC , has been applied to the software as a result of certifi- 
1 c. m 

cation testing and analysis. Suppose that the failure rate of the software 
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is A and the failure rate associated with engineering change EC. Is A^ . 

Then the failure rate associated with all the engineering changes made 
to date is the sum 

A. + A, + ... + A 

I c m 

If the engineering changes have corrected all errors in the software, 
then the foregoing sum will equal A; otherwise, it will be less than A. 
But, in fact, no one will ever know which case holds, and we assume 
neither case. 

For convenience, we define A^ to be the deficiency, if any, between A 
and the foregoing sum. That is 



In this case, the quantities 

Pq = ^q/A, Pj = P2 = ^2^’ Pfp = 

are probabilities — namely, Pj is the probability that a failure was 

caused by the error corrected by engineering change ECj. (p Q Is the 

probability a failure was caused by an error not corrected by any 
engineering change.) 

If we assume an exponential distribution for time to next failure, in 
line with Adams' (8) and Nagel's (9) findings, the MTTF is the reciprocal 
of failure rate. We can then calculate a new MTTF after each successive 
engineering change has been made - namely, beginning with MTTFq, the 

MTTF m of the original product after m changes will be 

MTTFj = MTTF 0 /(1- Pl ) 

MTTF 2 = MTTFq/ ( 1-Pj - p 2 ) 


MTTF m ’ HTTFg/d-Pj- ... - p m ) 

The p^ values, or correspondingly A^/A, can be expected to be decreasing 

in size, even though we cannot observe them directly with any certainty. 
This is because the errors with the highest associated rates of failure 
will be most likely detected and corrected earliest. That can't be guar- 
anteed, of course, because a rare failure may well occur early as well 
and a correction made for it. 
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This expected decrease in size can be modeled in a simplified form if 
the are defined by the probability distribution of geometrically 

decreasing terms 

P.j = (l-<x) 0 > a >1 

That is, each Pj is a fraction a of the preceeding p^_j. We can expli- 
citly sum the denominator on the right side of the MTTF m equation to get 
a new formula 

MTTF m = MTTF n R m , where R = 1/a 
m u 

In this formula, R is the average fractional improvement of the MTTF for 
each engineering change. In fact, in actual practice, R is just an 
average. Some engineering changes will affect the MTTF more than others 
depending upon the rate of failure associated with the error that has 
been fixed. 

This particular software reliability model has been independently derived 
by several other people starting with different initial assumptions. It 
is equivalent to the Moranda geometric de-Eutrophication model and the 
Ramamoorthy-Bastani input domain based model. Moreover all of these 
models can be viewed as special cases of the Cox Proportional Hazard model 


It is well known that engineer 
errors in a software product, 
changes are much smaller when < 
group than with a separate f i el 
augment the above model with a 
changes themselves. For this p 
EC^ introduces a failure rate £ 

1 at ion needs to be modified to 
following 


ing changes themselves can introduce more 
It appears that errors induced by such 
carried out by the original development 
Id support group; but, nevertheless, we 
contribution of error from engineering 
purpose, assume that engineering change 


at the level of 


then the above calcu- 


alter the definition of the p. to the 


P, - (X, 


p,)A 


In this case, the remaining calculations go as before with Pq again 

defined to account for the discrepancy but with the Same end result, 
namely 


MTTF_ * MTTF n R" 
m u 
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A Reasonableness Check of the Model 

In 1980 Adams analyzed the software failure history of a number of large 
software products. Table 1 is taken from that work and illustrates the 
percent of errors in various failure rate classes. Two striking features 
of this data are the wide range of failure rates and the high percentage 
of very low rate errors. One third of the errors have MTTF of 5000 years. 

Table 1 

FITTED PERCENTAGE DEFECTS 


MEAN TIME TO PROBLEM OCCURRENCE IN KMONTHS BY RATE CLASS 


PRODUCT 

60 

19 

6 

1.9 

.6 

.19 

.06 

.019 

1 

34.2 

28.8 

17.8 

10.3 

5.0 

2.1 

1.2 

0.7 

2 

34.3 

28.0 

18.2 

9.7 

4.5 

3.2 

1.5 

0.7 

3 

33.7 

28.5 

18.0 

8.7 

6.5 

2.8 

1.4 

0.4 

4 

34.2 

28.5 

18.7 

11.9 

4.4 

2.0 

0.3 

0.1 

5 

34.2 

28.5 

18.4 

9.4 

4.4 

2.9 

1.4 

0.7 

6 

32.0 

28.2 

20.1 

11.5 

5.0 

2.1 

0.8 

0.3 

7 

34.0 

28.5 

18.5 

9.9 

4.5 

2.7 

1.4 

0.6 

8 

31.9 

27.1 

18.4 

11.1 

6.5 

2.7 

1.4 

1.1 

9 

31.2 

27.6 

20.4 

12.8 

5.6 

1.9 

0.5 

0.0 


Table 1 gives a new insight into the power of statistical testing, rela- 
tive to selective testing or inspection, for improving MTTF. Finding 
errors at random is a very different matter than finding execution fail- 
ures at random. One third of the errors found at random will hardly affect 
the MTTF at all; the next quarter of the errors found at random do little 
more; The two highest rate classes, some two percent of the errors, cause 
a thousand times more failures per error than the two lowest rate classes, 
some sixty percent of the errors. That is, statistical testing will un- 
cover the high rate errors by a factor of 2000/60, some 30 to 1, while ran- 
domly finding errors uncovers high rate errors by a fraction of only 1 to 30. 

The availability of the Adams data provides a unique opportunity for 
checking model(s) reasonableness, since it can provide failure rate as 
a function of engineering change. Most available data is given in terms 
of errors found or inter-fail times but not true failure rate. With the 
Adams data separate examinations of model assumptions and parameter esti- 
mation techniques can be performed. Quantile-Quantile and trend plots 
have been previously proposed for comparing the goodness of fit of differ- 
ent models but. without failure rate data were unable to differentiate 
between assumptions and estimation techniques when models performed poorly. 

The reasonableness analysis for the certification model was performed in 
two parts, first assuming perfect debugging of the software and subse- 
quently considering the effect of introducing errors during product repair. 
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To perform the analysis, failure rate data exhibiting the effects of engi- 
neering changes was derived from the Adams data which was reported in 
terms of failure rate classes and failure counts. First Xj, X 2 . ... X Q 

were defined as the failure rates associated with the eight rate classes 
and n(i,k) as the number of errors in rate class i after k engineering 
changes. The initial failure rate for a product (before making any 
engineering changes) can then be expressed as 

Fq = Xj n(i ,0). 

To introduce the effect of engineering changes, successive failure rates 
must be derived by simulating the occurrence of failures. This is done by 
expressing the probability of the first error being from rate class i as 

A. n(i,0)/X 

and using a random number generator to select a rate class (designated i Q ) 

according to these probabilities. The number of errors in each class rate 
after removing the first error would then be expessed as 

n(i ,1) = n(i ,0) for i f i Q 
n(i ,1) - n(i ,0) - 1 for i = i Q . 

Since the number of errors for each rate class can be derived, the failure 
rate for the product after removing one error (first engineering change) 
can be expressed as 

Fj = I® X. n ( i , 1 ) . 

This process is repeatable to develop successive failure rates for the 
product and produces a single realization of a failure rate curve based 
on the Adams data. 

For reasonableness analysis an expected failure rate curve obtained by 
averaging a number of realizations is a better tool. Figure 1 illustrates 
such a curve that was created by averaging 100 realizations of the Adams 
data assuming an initial count of 500 errors. The availability of the 
expected failure rate curves permits an examination of the reasonableness 
of the proposed certification model and a comparison of its assumptions 
and estimation techniques against other software reliability models. 

The curve illustrated in Figure 1 is not of the form l/MR k (the reciprocal 
of MTTF In the certification model) since Its logarithm Is not linear, as 
shown in Figure 2. However large segments of the curve are of the desired 
form which suggests that the model is useful for certification but not 
extended prediction. Since our objective is software certification, the 
model satisfies this role and introducing complexity for long range predic- 
tion is not warranted. 
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LOG(FAILURE RATE FROM ADAMS DATA) 
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The failure rate curve shows that failure rate does not decrease linearly 
with engineering changes as assumed by the Jellnski-Motanda and derivative 
models. The Littlewood-Verrall model has good fit with the failure rate 
curve and has been used by one of the authors for long range predictions. 

Failure rate curves that cover the imperfect debugging ease require addi- 
tional knowledge pf the probability ( y ) that a fix creates an error and the 
probability (g.j) that the created error is from rate class i. A reasonable 

assumption is that y should be in the range 0 to .25 and based on Adams 
suggestion (14) g.. can be derived by assuming the repair process is similar 

to the development process. The following set of g. values have been 
experimentally derived: 1 

Rate Class of Created Error 

1 2 3 4 5 6 7 8 

Probabi 1 ity .04 .08 .15 .22 .20 .16 .09 .06 

Expected failure rate curves have been generated using the derived g. data 

over a range of Y values. Analysis indicates that the certification 
model is equally useful in the imperfect debugging case where the major 
distinction is the appearance of fatter tails than in the perfect 
debugging case. 

■ ■ |" " - • • 

Parameter Estimation 

To use the proposed model for actual software certification, methods are 
required for estimating the model parameters (MTTFg and R) from recorded 

testing data. The suggested estimation procedure differs from methods 
used by other reliability models and is based on a least square technique. 

I 

Let tj, t 2 , ...» t p be the successive interfail times for a product under 

test and certification. From time to time, engineering changes will be 
made to the product in response to observed failures. These changes are 
introduced after the failures are observed, and typically packaged in an 
incremental release from development to test. For each i, let c^ be the 

cumulative number of engineering changes made to the product after the 
interfail Interval measured by t^. The t^ introduce a source of random- 
ness since the times to failure will vary about the mean. The certifica- 
tion model and most other models assume an exponential distribution, 
which seems to be corroborated by the Nagel and Skrivan work. 
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Properties of a model's estimators, such as unbiasedness, are as important 
as the underlying model assumptions. Estimators for the MTTFq and R 

parameters in the certification model have been developed, which are com- 
puted with a least squares analysis of the logarithms of the interfail 
times and a bias correction. 

Most existing models rely on maximum likelihood estimators (MLE) which have 
a known set of problems as discussed by Forman-Singpurwalla, Sukert, and 
Littlewood-Verrall relative to the Jelinski-Moranda model. It has been 
demonstrated that for practical number of data points MLE exhibits bias and 
has greater variance than .the estimators log least squares estimators. The 
bias can not be corrected because there is not a closed form MLE solution. 

Figure 3 shows the MTTF curves when the logarithmic technique is used for 
estimating the MTTFq and R parameters. As calibration points, 50, 100, 

150 and 200 data points were selected to evaluate the method using the 
simulation of the Adams data. Since the intent at this point was to demon- 
strate the effectiveness of the estimators, interfail times simulated from 

C T 

a curve of the form MTTFqR could have been used. However using any 
realization, such as illustrated in Figure 3, provides a more interesting 
test and a closer simulation of what will actually occur. As can be seen, 
all curves give a good prediction of MTTF with the most discrepancy in the 
50 point case when prediction is carried too far into the future. 
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WHY MTTF? 

Quality Should Be Measured From Customer's Perspective 

How often does it fail? 

MTTF reports by large commercial customers 
MTTF ship criteria 

What's the severity of a fail? 
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WHY MTTF? 


r* 

Management Decisions 

High MTTF is good/ low MTTF is bad. 

If x errors hove been fixed, is that good or bad? 

If x is small, either: 

1 ) The re we re very few errors made 
or 

2) There are plenty of errors, but the 
testing proce~ss is ineffective 

If x is large, either: 

1) There were a large number of errors made 
or 

2) The testing process is very effective 
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WHY MTTF? 


Eprrvi 

w: 


You Get What You Measure. 


Why Not Measure What The Customer Wants? 
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MODELING GOALS 


!«►' PyCV* 
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i 


Certify MTTF 


Avoid Restrictive Assumptions 


Provide Statistically Sound Estimators 


Keep It Conceptually Simple 
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Each lifetime has a probability of occurrence. 

Failure rate F of the programs is averaged over all 
possible lifetimes. 

1 

MTTF = M = — 

F 

Each error e has an associated failure rate Fe. 

F = zFe. 
e 

Fe 

Let pe = p~ . (Relative frequency of error e) 
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Upon removal of error e; 


-> f - Fe 


M 


-> 


F-Fe 


F(l-p^) 


= M 


l-pe 


In general,, upon removal of errors el, e2, -- 


M 


-> M 


1-pl -p2 - ... - pn 


-■> en. 
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Then, after k engineering changes. 


M 




Pi 

l-^(l-a)a 1-1 
1=1 


M 

I-CHD l-d k 


(JK) 




where R = — . 

a 


MR f 
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SOFTWARE CERTIFICATION MODEL 


Mean Time To Failure After k Engineering Changes 



where 


M = Mean time to failure before any 
engineering changes 

R = Factor for relative improvement in MTTF 
due to a single engineering change 
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SOFTWARE CERTIFICATION MODEL ASSUMPTION 

Equivalent to 

Ramamoorthy-Bastani 

Application: Nuclear Reactors 

Mo randa Geometric De-Eutrophication 


Special case of 


r 

i 


Cox Proportional Hazard Rate 

Application: Boeing Computer Services 


i 

* 
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I MODEL REASONABLENESS 

( j 

Adams data 


j Large software products 

l \ . i 

j Product usage 

| | Defects found 

I 

| Failures due to a defect 

j MTTF classification of defects 

i : : 

I . 1. 

w ; . . 

J Allows independent check of 

| Model assumptions 

1 I 

l Estimation procedures 

F • 

I 



P. Currit 
IBM 

'23 of 34 


;•» TTf ."icTi..; 


of 34 




r W^'.r. „ ™ ...y , v, ia; ' »**v*j^ Sw ~. 


'-Vi* '** ■■--* 


Fitted Percentage Defects 
Mean Time to Problem Occurrence in Years 


Product 

1.6 

5 

16 

50 

160 

500 

1600 

5000 

1 

0.7 

1.2 

2.1 

5.0 

10.3 

17.8 

28.8 

34.2 

2 

0.7 

1.5 

3.2 

4.5 

9.7 

18.2 

28.0 

34.3 

3 

0.4 

1.4 

2.8 

6.5 

8.7 

18.0 

28.5 

33.7 

4 

: : ! ' 

0.1 

0.3 

2.0 

4.4 

11.9 

18.7 

23.5 

34.2 

5 

0.7 

1.4 

2.9 

4.4 

9.4 

18.4 

28.5 

34.2 

6 

0.3 

0.8 

2.1 

5.0 

11.5 

20.1 

28.2 

32.0 

7 

0.6 

1.4 

2.7 

4.5 

9.9 

18.5 

28.5 

34.0 

8 

1.1 

1.4 

2.7 

6.5 

11.1 

18.4 

27.1 

31.9 

9 

0.0 

0.5 

1.9 

5.6 

12.8 

20.4 

27.6 

31.2 

E. N. Adams, RC 8228, 4/11/80, p. 
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SIMULATED FAILURE RATE CURVES 
Based on Adams Data 

For Perfect Debugging 

Let n(i,k) = number of errors of failure 
rate Aj after k fixes 

Let F k = program failure rate after k fixes 
8 

p 0 = E l { n( i,0) 
i=l 

Probability that first failure is caused by an error of 
rate Aj 

= A. n(i,0)/F 0 

Randomly select i Q according to preceding probability 
n ( i , 1 ) = n(i,0) i ^ i Q 

n(i,l) = n(i,0)-l i = Iq 

8 

Fj = E Aj n(ivl) 
i=l 

Repeat to determine F 2 . F y .... 
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PARAMETER ESTIMATION 
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Comparison with Maximum Likelihood 
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MISRA DATA USING CLEAN ROOM MODEL 


CfcEfN*KVHIMl INTER I All TIMES 
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Implies 15 expected errors in STS4 
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Estimators: Statistical Properties 

• Unbiased 

• Decreasing variance 

• Relative efficiency 
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Abstract 


In these days of soaring software costs it becomes increasingly important 
to properly manage a software development project. One element of the 
management task is the projection and tracking of manpower required to 
perform the task. In addition, since the total cost of the task is 
directly related to the initial quality built into the software, it 
becomes a necessity to project the development manpower in a way to 
attain that quality. The purpose of this paper, then, is to describe an 
approach to projecting and tracking manpower with quality in mind. 

The basic approach is to begin with a current manpower model which 
accurately describes the cost of developing a usable element of software. 
Then, based on the assumption that improving quality does not cost more 
over the entire life cycle, the current model is modified to reflect 
greater expenditure on elements of work which are known to improve 
initial quality. This requires a reduction in the cost of other elements 
since an increase in quality does not cost more. The obvious elements 
to reduce are those directly affected by quality. The final result of 
this type of analysis is the development of a manpower model designed to 
generate quality software* 

The resulting model is useful as a projection tool but must be validated 
in order to be used as an on-going software cost engineering tool. A 
procedure is developed to facilitate the tracking of model projections 
and actual data to allow the model to be tuned. Finally, since the 
model must be used in an environment of overlapping development activities 
on a progression of software elements in development and maintenance, a 
manpower allocation model is developed for use in a steady state development/ 
maintenance enviornment . 
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Introduction 


In these days of soaring software costs it becomes increasingly important 
to properly manage a software development project. One element of the 
management task is the projection and tracking of manpower required to 
perform the task. In addition, since the total cost of the task is 
directly related to the Initial quality built into the software, it 
becomes a necessity to project the development manpower in a way to 
attain that quality. The purpose of this paper, then, is to describe an 
approach to projecting and tracking manpower with quality in mind. 

The basic approach is to begin with a current manpower model which 
accurately describes the cost of developing a usable element of software. 
Then, based on the assumption that improving quality does not cost more 
over the entire life cycle, the current model is modified to reflect 
greater expenditure on elements of work which are known to improve 
initial quality. This requires a reduction in the cost of other elements 
since an increase in quality does not cost more. The obvious elements 
to reduce are those directly affected by quality. The final result of 
this type of analysis is the development of a manpower model designed to 
generate quality software. 

The resulting model is useful as a projection tool but must be validated 
in order to be used as an on-going software cost engineering tool. A 
procedure is developed to facilitate the tracking of model projections 
and actual data to allow the model to be tuned. Finally, since the 
model must be used in an environment of overlapping development activities 
on a progression of software elements in development and maintenance, a 
manpower allocation model is developed for use in a steady state development/ 
maintenance environment. 


The Cost of Ignoring Initial Quality 


In the past software projects have generated initial software relying on 
the usual network of functional, subsystem and system tests to find the 
''bugs 11 prior to system delivery. This is a questionable approach, 
however, when the overall cost of the finished (debugged) system is 
considered. As Figure 1 shows, software development is a pyramiding or 
stair-stepping group of functions each of which, when begun, continues 
until the project is complete. Errors found early in development when 
only the programmer is involved are essentially "free". That is, they 
can be absorbed in the normal work flow at a minimum cost. Once the 
code is placed on the master system, however, an error must be 
documented by a discrepancey report (DR) which must be eventually closed 
by all elements of the project. And so it goes, the later in the life 
cycle that a software error is discovered the more elements of the 
project are involved in the software and the more work must be done to 
correct the error. This naturally costs more. The result is that shown 
generically in Figure 1 and can be summarized as: The later in the 

software development cycle that an error is found, the more it costs. 

The obvious conclusion is that steps should be taken to find errors 
early in the development process to minimize cost. The first step in 
this process is to define the positive actions required and to plan the 
life cycle and project appropriate manpower to accomplish those actions. 
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The Current Manpower Model 


As mentioned in the introduction, the basic approach requires the use or 
generation of a manpower model which reflects the current software 
development environment. The complete development of the model used is 
described in Reference 1, however, to aid in this discussion a brief 
summary is included here. The development of the model was initiated by 
delineating all project costs with the following information: 

o Type cost: direct change request (CR) cost/ technical and 

project support costs. 

o Organization: software development project organization, 
o Function: purpose of the cost, 

o Drivers: factors affecting the cost, 

o Estimation methodology: how the item is estimated. 


These project costs were placed into categories and then reordered by 
those categories. The categories used were as follows: 


o 

o 

o 

o 

o 

o 

o 


Category I: 
Category II: 
Category III: 
Category IV: 
Category V: 
Category VI: 
Category VII: 


direct CR cost 

development/verification technical support 
preprocessors 

management and common support 

project release/ schedule/ reconfiguration 

maintenance 

project independent costs 


Using the first five categories (ignoring maintenance and project independent 
costs for the moment) and examining Release 19 of the Shuttle onboard 
Primary Avionics Software System (PASS) we can express the cost of that 
release with the percent model shown in Figure 2. 
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CATEGORY 

I 

II 

II 

III 

IV 

V 

TOTALS 


AREA 

FUNCTION 

R19 Manmonths 

% 

DEV. 

Direct CR Est. 

197 

16 

VERIF . 

Direct CR Est. 

173 

14 

DEV. 

Requirements Analysis (R.A. ) 

13 

1 


Level 3 Test (L3) 

26 

2 


Systems Analysis (SA) 

9 

1 


Systems Architecture (SAr) 

27 

2 

VERIF. 

Studies and Audits (ST/AU) 

19 

2 


Common Function Tests (CF) 

8 

1 


Systems Measurement (S Meas) 

14 

1 


Level 7 Test (L7) 

139 

11 


Level 6 DR Support 

38 

3 

DEV. 

Preprocessors (PREP) 

30 

3 

DEV, VERIF, 

Management and Support (M&S) 

135 

11 

P.0. 

Common Support (CS) 

70 

6 

SFO 

Build and Integration (B&I) 

140 

11 


Resource Management (RES MGT . ) 

100 

8 


Configuration Management/ 

80 

7 


Data Management (CM/DM) 


1218 100 


Figure 2. Percent Model for PASS Release 19 
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By examining these costs by category, it can be seen that a factor can be 
developed which will relate the total cost through Category IV to the direct 
costs contained in Category I. This is accomplished by the following 
calculations; 

DIRECT CR COSTS = CRp = CATEGORY I 

INDIRECT CR COSTS = CR I = CATEGORIES II - IV 

FACTOR - CR P + CR I _ I+II+III+IV 
CR D " 1 


FACTOR = 


370 + 293 + 30 + 205 


370 


2.5 


This factor can be used along with estimates of the direct CR costs to cal- 
culate those costs driven by CR's. However, maintenance (Category VI) costs 
are also driven by CR costs whereas Categories V and VII are not. 

The cost to maintain a CR is given by the area of the difference between a 
Rayleigh curve without the CR and one which includes it evaluated over the 
maintenance timeframe. (Figure 3) 
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This is calculated by taking the integral of the difference between the 
two generalized equations of the curves and letting the time of the maximum 
be one year consistent with current release plans. Doing this a formula for 
maintenance is generated: 

“ a 2 

Maintenance = e ^1^2^ 

But the time of the maximum is one year which implies that a ? = 1/2. Thus: 

- 1/2 

Maintenance = e (K^-K^) = .6 (K^-K^) 

This means that maintenance costs are 60% of the total cost of developing 
a CR. However, the maintenance timeframe does not begin for all project areas 
at the time of the maximum. If the time of the maximum plus .3 years is 
used for the beginning of the maintenance timeframe then the following 
equations are derived: 

Maintenance = .56 (K^K^) 

Development = .44 (K^-K ^ ) 

The factor necessary to add maintenance costs to the development cost is 
given by: 

Maintenance = (.56 + .44) / . 44 = 2.25 

Thus, we have developed a useable manpower model that can be expressed in 
terms of categories of cost and associated manpower, a percent model based 
on the categories and a generalized cost model shown in Figure 4 which uses 
factors to arrive at total costs driven by direct CR costs. 
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CTIVITY 


COST CATEGORY 


ALGORITHM 


II III IV 


VI VII 


EVELOPMENT THROUGH 
MITIAL SYSTEM 


2.5 (CR d ) 


DEVELOPMENT AND X XXX X 2.25 (2.5(CR n )) = 5.6(CR n ) 

MAINTENANCE v u 


TOTAL PROJECT X X X XX X X 5.6 (CR n ) + CAT. V + CAT. VII 

COST u 


Figure 4. Generalized Cost Model 
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Development of a Manpower Model Based on Quality 


As with any modeling exercise, this one is initiated by collecting data. 

All data is collected by releases of Shuttle onboard PASS. Since the data 
for Release 19 is incomplete only the data for Releases 16 and 17/18 is used. 

Data Collection 


All the data gathered is from the Project Development Plan and data 
bases which support the plan or from Project Office history files. 

Requirements Change Request (CR) data is collected as the total \ 

number by release. Discrepancy Report (DR) data reflects the total j 

number by release divided into those which require a code fix and \ 

those which do not. It is important to note that the "No Fix" f 

category includes user notes, waivers, and other categories which have f 

the potential of becoming "Fix" DR's in the future. The largest S 

group in the "No Fix" category, however, are the DR's which are simply l 

not PASS problems but simulator, user or misinterpretation errors. f 

The manpower data is divided into base work prior to system delivery 
and maintenance work after delivery. Each of these categories is 

subdivided into work performed by the development and verification ‘ 

groups. The data collected is then used to generate the data j 

comparison table presented as Figure 5. The first four columns of 
data in the table represent the data collected from the project. 

The remaining five columns show relationships derived from the 
ratios of the data elements. 

: . : j 

! 

1 

i 

! 


I 

j 

I 

i 


! 

t 

b 

! 

I 

s 
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DR Analysis 


The next step in the process is to analyze the DR and maintenance 
data to generate the Average Cost of "Fix" and "No Fix" DR's. 
Beginning with Release 16 data the initial action required is to 
remove the technical support manpower from the maintenance manpower 
by dividing by 2.25 (the technical support factor less the project 
office). Then knowing that, on the average, five times as much 
effort is spent on "Fix" DR's as "No Fix" DR's, the following 
equation can be written: 

1435/2.25 = 2371 x + 2289 (x/5) 

The solution of the equation renders the result that each "Fix" DR 
cost 4.50 mandays total or 2.25 mandays for each of development and 
verification. Performing the same analysis for Release 17/18, 
using development maintenance only since verification maintenance 
was not required, the equation yields 2.7 mandays of development 
effort for each DR. Averaging these figures the following direct 
impact values are derived: 

o The direct impact of a DR which is fixed is: 

2.5 md FOR DEVELOPMENT 
2.5 md FOR VERIFICATION 

o The direct impact of a DR which is not fixed is: 

.5 md FOR DEVELOPMENT 
.5 md FOR VERIFICATION 
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DR Prevention 



In the Shuttle onboard PASS project DR's are written for a problem only 
after the software causing the problem has been placed on the Master 
System. Once a DR is written, all areas of the project become involved 
in its closure regardless of whether it is a problem or not. Hence, 
there are two possibilities for reducing the number of DR's. The first 
is to enhance the requirements analysis activities to give a reliable 
point of coordination before the DR is written. This subject will not be 
treated further in this study but will be the object of a later study. 

The second possibility, and the main object of this study, is to enhance 
the development process prior to the master system build. The two 
ways to accomplish this are to enhance requirements analysis activities 
and design and code reviews early in the initial development cycle. Re- 
quirements analysis should be enhanced to improve the quality of CR' s 
before implementation begins, shepherd CR's through the development 
life cycle, help specify level 1 and 2 tests, review level 1 and 2 
test results and support design and code reviews. Design and code 
reviews could be improved by allowing more time for the reviews, improving 
checklists and review documentation, providing for improved and dedi- 
cated review moderators and to require wider involvement from functional 
areas of the project. 
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Modifying the Current Manpower Model 


The approach, then, to modifying the current manpower model is to consider 
which areas of the model should be increased for DR prevention, which 
areas of cost will benefit from having fewer DR's to deal with and which 
areas contain the skills required to enhance early development. These 
areas should be modified accordingly to create an incremental release 
model which assumes an enhanced early development and fewer DR's - in 
other# words, a model which assumes and also assures quality. Figure 6 
depicts the process of modifying the current manpower model which is 
reflected under the "Old %" column. The modifications are listed under 
the "A %" column. It should be noted that 3% is taken from each of 
Level 6 and 7 verification and redirected toward tHe early development 
activity. This results in no change to the overall project development 
model. This is consistent with the introductory assumption that quality 
does not cost more. The final two columns show the current and quality 
manpower models in terms of Release 19 manmonths. 
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CAT. 

AREA 

FUNCTION 

OLD % 

A% 

NEW % 

EXAMPLE 

I 

DEV. 

DIRECT EST. 

16 

,4(16)-+6 

22 

197 


VERIF. 

DIRECT EST. 

14 


14 

173 

II 

DEV. 

RA 

1 


1 

13 



L3 

2 


2 

26 



SA 

1 


1 

9 



SAr 

2 


2 

27 



DR 

- 


- 

0 

II 

VERIF, 

RA 

— 




0 



ST/ AU 

2 


2 

19 



CF 

1 


1 

8 



S MEAS 

1 


1 

14 



L7 

11 

-3 

8 

139 



PRE Cl DR's 

3 

-3 

- 

38 

III 

DEV. 

PREP 

3 


3 

30 

IV 

All 

M & S 

11 


11 

135 



CS 

6 


6 

70 

V 

SFO 

B & I 

11 


11 

140 



RES MGT 

8 


8 

100 



CM/DM 

7 


7 

80 




100 

0 

100 

1218 


Figure 6. Modifying the Current Manpower Model 


I. 

fr>. 



t 

Sf 


NEW 

EXAMPLE 


274 

173 

13 
26 

9 

27 

0 

0 

19 

8 

14 
100 

0 

30 

135 

70 

140 

100 

80 


1218 
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The amount of manpower relocated in the model change is not arbitrarily 
selected. The direct development manpower is increased by 40%. 

Twenty percent is added to account for increased requirements 

analysis. This figure is based on early Release 16 history when a , 
separate requirements analysis group was maintained in the development 1 
organization. This reflects a return to heavy emphasis on requirements 1 
analysis as a front end process of the project. The remaining 20% ? 
is added to the direct development manpower to account for enhanced 

design and code reviews. This figure is based on a comparison of j; 
the old and new review processes in terms of increased elapsed time \ 
of the reviews, broader involvement of the project in the reviews § 
and increased documentation and tracking. j 

i 

To account for the 40% increase in development a corresponding de- 'J 
crease must occur elsewhere since quality does not cost more. The f 
two areas selected to sustain the reduction are Level 6 DR support ] 
and Level 7 Test. Each of these reductions is examined individually * j 
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Savings Due to Fewer PR's 


Better initial quality should be reflected in the project as fewer 
DR's during the development and verification process. The task 
then is to quantify the projected savings. To do this the following 
procedure is used. By examining Release 19 data it is noted that 
there are 235 CRTs included in the release. Based on the Release 
16 and 17/18 DR/ CR ratios it can be projected that there will be 
705 DR's during the completion of the development life cycle. Of 
these only 40% or 282 should be code changes. If we increase the 
development budget to improve the initial quality of the software 
going to each build, a decrease in DR's after the builds should be 
expected. It should also be expected that not all DR's will be 
eliminated. Selecting 50% of DR's as a target for elimination, a 
projected savings can be calculated as: 

(282 DR's) (.5) (2.5 md/DR) = 18 man months 

Including technical support (without the project office) a savings 
of 41 man months can be projected in the verification area. This 
amount alone justifies the reduction to the level 6 test function. 
However, the development area will experience a similar 41 man 
month decrease during the verification support time frame. This 
means that our model is conservative* The goal is to reduce the DR 
number by 50% but a 25% reduction will enable the development and 
verification areas to "break even". 
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Savings Due to Rephasing Skills 


The skills necessary to enhance the requirements analysis task currently 
reside in the level 7 test group in the verification organization. 
Rephasing these skills to the requirements analysis task must there- 
fore be justified. In the current model the level 7 task took 11% 
of the release manpower. Examination of Release 17/18 data, however, 
shows that of the 674 DR's found by verification, only 35 were found 
by the level 7 test group. Of these, only 12 were significant flight 
software problems. It appears obvious then that the recommended re- 
phasing of skills would be not only feasible but desirable. Three 
percent of the release manpower would be adequate to perform the 
requirements analysis task which would be reflected in fewer DR's 
reaching the master build. This would leave 8% to perform level 7 
testing which could be performed in a more stable enviornment due to 
fewer DR's during the performance time frame. 

The completion of this exercise concludes the justification of the 
manpower movements to create the quality model. The % model given 
in Figure 6 can then be used to project manpower for Release 20 
and beyond. 
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A Generalized Quality Model 


A % model alone does not completely satisfy the project need. Consequently, 
the procedure used earlier in the paper to generate a generalized model 
is used again with the quality model. By examining the costs 
through category IV, the following calculations can be made: 

DIRECT CR COSTS = CRp = CATEGORY I 


INDIRECT CR COSTS = CRj = CATEGORIES II - IV 

CR^ + CRT _ H-II-HII+IV 
FACTOR = — — 

CR D 1 


FACTOR 


447 + 216 = 30 = 205 
447 


2.0 


Again this factor can be used to calculate the development and verifi- 
cation costs directly driven by CR's. 

Once again maintenance, which is not included in this factor must be 
considered. By using generalized equations for Rayleigh curves with 
and without a CR the following equations are derived: 

DEVELOPMENT = .4 (R, - K 2 > 

MAINTENANCE = .6 - i^) 
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But part of the maintenance time frame in the development area is 
now devoted to verification support and maintenance begins at the 
completion of the verification cycle. By separating the verification 
support from the maintenance as shown in Figure 7 the equations are 
modified as follows: 


CAPABILITIES 

DEVELOPMENT = 

.4 

< K 1 

- K 2> 

VERIFICATION 

SUPPORT 

.1 

< K 1 

- k 2 ) 

MAINTENANCE 

DEVELOPMENT = 

.5 

(K1 

CM 

1 


But the verification support contains some verification costs. Cor- 
recting for this the equations become: 

CAPABILITIES DEVELOPMENT = .43 - K ? ) 


VERIFICATION SUPPORT = .07 (K^ - K 2 ) 

MAINTENANCE DEVELOPMENT = .5 - K 2 ) 

Thus the factors necessary to add verification support and maintenance 
costs to the costs directly driven by CRs are: 


VERIFICATION SUPPORT 

= 

.43+. 07 

= 1.15 



.43 


MAINTENANCE 

=2 

.5+. 5 

- 2.00 



.5 



Hence, the generalized model presented in Figure 8 is complete. 
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; Figure 7. Division of Life Cycle Manpower 
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Finally, it should be noted that if better initial quality is intro- 
duced into the software system then the cost of maintenance should go 
down. Again, the quality model is conservative since it did not take 
this into account. As actuals are accrued, then the model can be 
tuned to reflect those actuals. 
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Extension to a Manpower Allocation Model 


There comes a time in the life of most projects when they can no longer be 
viewed as one Rayleigh curve but as one curve followed by a series of 
smaller curves each of which represents a release, A generic graph of this 
time frame is reflected in Figure 9. As a steady state, uniform set of 
releases is reached the total manpower line tends to a steady state and 
the maintenance line also tends to a steady state. Studies using groups 
of curves in this fashion have shown that the steady state maintenance level 
reached is 25% of the total steady state level. Based on the study conclu- 
sion the first allocation of manpower should be: 


DEVELOPMENT 

75% 

MAINTENANCE 

25% 


To allocate below this major division, a % model base on organization 
rather than category is required. Performing this reorganization the 
quality % model by organization is given in Figure 10. 





ORIGINAL PAGE fSf 
OF POOR QUALITY 


0RG. 

CAT. 

FUNCTION 

NEW % 


NEW EXAMPL 

DEV. 

I 

DEV. DIRECT EST. 

22 

(3) 

274 


DEV. 

I 

VERIF. DIRECT EST, 

14 


173 


DEV. 

II 

VERIF. ST/AU 

2 


19 


DEV. 

II 

VERIF. CF 

1 


8 


DEV. ' 

III 

PREP 

3 


30 


DEV. 

IV 

M & S (PART) 

8 


98 






'50 


60 

SE 

II 

RA 

1 


13 


SE 

II 

L3 

2 


26 


SE 

II 

SA 

1 


9 


SE 

II 

SAr 

2 


27 


SE 

II 

SMEAS 

1 


14 


SE 

II 

L7 

8 


100 


SE 

IV 

M & S (PART) 

3 


37 






18 


"22 

P.0. 

IV 

CS 

6 


70 






6 



OTHER 

V 

B & I 

11 


140 


OTHER 

V 

RES MGMT 

8 


TOO 


OTHER 

V 

CM/DM 

7 


80 






26 


”32 


100 100 1218 121 



1 

] 




Figure 10. Quality % Model by Organization 
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Since in a incremental release time frame all activities of the project 
are progressing simultaneously, a time slice allocation can be made to 
each activity. The allocation to development activities is shown in 
Figure 11. The legend of this figure is the same as the % model with the 
following exceptions. Software development is divided into capabilities 
development (CD) which is ongoing development of the master jsystem and 
verification support (VS) which handles error correction dating the veri- 
fication time frame. Half of the verification support is set aside as a 
buffer to handle late, mandatory CR's required by near term flights. 

This allows the capabilities development to be scheduled and worked without 
interruption. It should be noted that a comparable amount of buffer 
must be set aside for verification. 


Including maintenance and development into one allocation model the model 
depicted in Figure 12 is attained. 
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Buffer Management 


Since the idea of CR buffers in the verification support and verification 
areas has been introduced, some attention must be paid to the management 
of those buffers. From Figure 10 it can be seen that the factor which 
represents organizational technical support (overhead) should be: 


DEVELOPMENT FACTOR 


CRD f CRD I 
CRD 


19 + 8 
19 


1.4 


VERIFICATION FACTOR = CRV - ^ tr CRVI = -At - =1.4 

CRV 14 

The buffer management factor then is 1.4 for both development and verification. 

The ground rules for buffer management can now be stated as: 

o Begin with defined CR buffers in the verification support and 
L6 verification allocations. 


o Adjust the buffers to account for the final build in each 

operational increment which is left open for DR corrections. 

o On a continuing basis account for change by: 

Adjusting both buffers by the direct CR estimates marked 
up by the buffer management factor. 

Adjusting for actuals overruns and underruns if required. 
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Model Tracking 

An initial model is good only for a first projection and allocation of 
manpower. In order to make the model useable on a continuing basis, 
actual data relating to the projected data should be tracked and used to 
validate and/or modify the model. For this model two types of data 
tracking are required. The first requirement is that the quality projected 
is attained. The most appropriate measure of quality appears to be DR’s 
per manmonth of development. The DR count is the number of DR's which 
require fixes written against the builds contained in the increment or 
release. The manpower number is the total manpower expended during 
capabilities development and verification support. This measurement is 
initiated at the end of the first capabilities development phase and 
terminates at Cl for a given release. As shown in Figure 13 an alert 
line has been established from Release 17/18 experience. If the measurement 
violates the alert line then an effort will be made to determine if the 
initial quality is not what had been projected. The maximum alert line 
is 75% of the Release 17/18 Fix DR per manmonth of development number 
found in Figure 5. A graph of this measurement is reviewed in the 
project and with the customer on a periodic basis. The second type of 
data tracking required is that expenditures by major function must be 
examined on a release basis and the data used to tune the model. To 
assist in this task a form has been developed to allow for the recording 
of schedule, manpower projections, buffer management, actuals and quality 
tracking data for a given release. This form is presented in Figure 14. 
Since it contains scheduling, tracking, and completion data it could be 
useful as a release management tool as well as a software cost engineering 
tool. 
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MM 
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MM 

— 

FUNCTION 
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% 
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Figure 14. Release Tracking Form 
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Model Sensitivities 


It is important to point out that the generalized model is sensitive to 
the direct CR estimates. The accuracy of this basic building block of 
most cost models is important to any project which has a significant amount 
of ongoing change. Also, since the quality model emphasises early develop- 
ment, the increased impact due to enhanced design and code reviews, re- 
quirements analysis and pre-build testing need to be included in the direct 
CR costs. The model is also sensitive to the number and cost of the DR f s 
generated during development and test of a release. The tracking of re- 
lease quality will keep a proper project focus on this sensitivity. 


a 
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Summary 

Since the total cost of a software development project is directly related 
to the initial quality built into the software, it becomes a necessity to 
project manpower to attain that quality. The basic approach takes a current 
manpower model and, by reflecting greater expenditure on elements which 
are known to improve initial quality, generates a new manpower model de- 
signed to generate quality software. Since the model must be used in an 
ongoing incremental release environment:, an allocation model is developed 
to allocate manpower across a project’s organization. Finally, a procedure 
is developed to allow the tracking of data for model validation and modi- 
fication. 
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ORIGINAL PAGE IB 
OF POOR QUALiTY 



PURPOSE 


• THE INCREMENTAL RELEASE PLAN IS A PROJECT PLANNING 

METHODOLOGY WHICH WILL RESULT IN HIGHER QUALITY 
FLIGHT SOFTWARE RELEASES: 


SMALLER , MORE FREQUENT DEVELOPMENT INCREMENTS 

MORE COMPREHENSIVE TESTING PRIOR TO FIELD 
DELIVERY 


• MANPOWER IS BEING REPHASED ON THE PROJECT TO PLACE MORE 
EMPHASIS ON REQUIREMENTS ANALYSIS , DESIGN/CODE REVIEWS 
AND PRE-BUILD TEST 


• THE PURPOSE OF THIS STUDY IS TO MODIFY THE CURRENT MAN- 
POWER MODEL TO REFLECT THESE CHANGES AND PROVIDE A USABLE 
ONGOING MODEL FOR CLASS 1 WORK IN THE FUTURE 



IMM SPACE SHUTTLE PROGRAMS 
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ORIGINAL PAGE SS 
OF POOR QUALITY 



APPROACH 


• START WITH CORRECT MANPOWER MODEL 

- MW DEVELOPMENT ESTIMATE 

- ROM =2.25(2.5(1.8(1 MW) » ) » 5. 6(1. 8(1 MW)) 

= 10.1 MW 


• BEGINNING WITH THE BASIC ASSUMPTION THAT IMPROVING QUALITY 
DOES NOT COST' MORE, WE DEVELOP A MODEL WHICH 

ADDS 40% TO DEVELOPMENT FOR BETTER REQUIREMENTS 
ANALYSIS AND BETTER DESIGN AND CODE REVIEWS 

THIS IS EQUIVALENT TO ADDING 1 DAY FOR R. A. AND 
1 DAY FOR REVIEWS 

- THIS WILL RESULT IN FEWER DRs SO EVEN THOUGH THE 
DEVELOPMENT COST IS UP THE TOTAL COST IS THE SAME 

THE DEVELOPMENT COST WOULD BE 1.4 MW WHILE THE 
DIRECT L6 COST IS THE SAME (.8) 

- ; ROM = 2.0 (1.15(2,0(2.2 MW))) = 4.6 (2.2 MW) = 10.1 MW 

l 

• THIS IS A CONSERVATIVE APPROACH FOR THE FIRST MODEL WHICH 
CAN BE MODIFIED BASED ON ACTUAL DATA 

• THIS MODEL IS INTENDED FOR USE WHEN THE INCREMENTAL RELEASE 
STRATEGY REACHES STEADY STATE 


SPACE SHUTTLE PROGRAMS 
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•ASK/ 

:a 
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iVivV 

MM 

DRs 

MAINT 

m 

nu/ 

cu. 

ml 

cu 

ml 

1)R 

MAT NT/ 
DEV 

DU/ 

MM DEV 

V 

DEV 

• ■_ 

2183 


817 

- 

1.3 

.2 

.2 


VERIF 

.. 

1718 

- 

618 

- 

■1.0 

.1 

.2 

- 

FIX 

^ * ; 

: - . 

2371 

- 

1.4 

- 

- .. 


1.1 

NO FIX 

- 

- 

2289 

- 

1 . 3 

- 

- 

- 

l.l 

TOTAL 

1725 

3901 

4660 

1435 

2.7 

2.3 

.3 

.4 

2.2 

718 

DEV 


1286 

- 

381 

- 

1.6 

.2 

.2 

- 

VERIF 

: - ■’ 

972 

- • 

3 

- 

1.2 

0 

0 

- 

FIX 

: ” - - . 

- 

951 

- 

1.2 

- 

- 

- 

.7 

NO FIX 

- 

: - 

14 4 9 

- 

1.9 

- 

- 

- 

1.1 

TOTAL 

782 

2258 

2440 

384 

3.1 

2.8 

.2 

.2 

2.0 


iSifi SPACE SHUTTLE POOGHAMS 






R16 AND R17/18 DH ANALYSIS 


k \ 


$ 


• The maintenance manpower includes technical support for both 
Development and Verification. 

• Taking that support out of R16 we get 1435/2.25 = 638mm 


• Knowing that we spend, on the average, 5 times as much effort on fix 
DRs as no fix DRs we can write: 

638 mm = 2371X+2289(X.5) 

3190 = 11855X+2289X 

3190 = 14144X 

X = .23mm 

= 4.50md 

or 2.25md/Fix DR for each of Development and 
Verification 


• Performing the same analysis for R17/R18 (for Development only since 
Verification maintenance 0 since Cl is so close to flight) we get: 

2.66 md/Fix DR Development 

• If we average these numbers we arrive at: 

The direct impact of a DR which is fixed is: 

2.5 md for Development 
2.5 md for Verification 

The direct impact of a DR which is not fixed is: 

.5 md for Development 
j - .5 md for Verification 
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OH POOR QUALITY 



OR PREVENTION 


• OR S ARE WRITTEN FOR A PROBLEM ONLY AFTER THE* SOFTWARE 
CAUSING THE PROBLEM HAS BEEN PLACED ON THE MASTER SYSTEM 

• ! ONCE A DR IS WRITTEN , ALL AREAS OF THE PROJECT BECOME 
1 INVOLVED IN ITS CLOSURE REGARDLESS OF WHETHER IT IS A 

PROBLEM OR NOT 

• HENCE , THERE ARE TOO POSSIBILITIES FOR REDUCING THE NUMER 
OF DRs: 

ENHANCE THE REQUIREMENTS ANALYSIS ACTIVITY 
TO GIVE A POINT OF COORDINATION BEFORE THE 
DR IS WRITTEN 

ENHANCE THE DEVELOPMENT PROCESS PRIOR TO THE 
BUILD FOR THE MASTER SYSTEM 

« ENHANCE REQUIREMENTS ANALYSIS 

IMPROVE QUALITY OF CR BEFORE 
IMPLEMENTATION BEGINS 

COORDINATE CRs 

SPECIFY L1/L2 TESTS 

- REVIEW TEST RESULTS 
SUPPORT D&C REVIEWS 

• ENHANCE DESIGN & CODE REVIEWS 

- SPEND MORE TIME ON REVIEW 

V IMPROVE. CHECKLISTS , DOCUMEN- 

TATION 

IMPROVED/DEDICATED MODERATORS 
UNDER INVOLVEMENT ON PROJECT 


SPACE SHUTTLE PROGRAMS 
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SOFTWARE QUALITY TRACKING 


• MOST APPROPRIATE MEASURE IS DR'S PER MAN MONTH 
-BASE-MANPOWER NUMBER TOTAL EXPENDED DURING 
CAPABILITIES DEVELOPMENT AND CAPABILITIES REFINEMENT 
-DR COUNT WILL BE THE DR'S WHICH REQUIRE FIXES WRITTEN 
AGAINST THE BUILDS CONTAINED IN THE INCREMENT 


• INITIATE MEASUREMENT AT THE END OF FIRST CAPABILITIES 
DEVELOPMENT PHASE. TERMINATE AT Cl 
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1.0 INTRODUCTION 


This paper summarizes a current Data & Analysis Center for Software (DACS) 
effort to develop software baselines. This baseline effort is an on-going 
activity; that is, the baselines are meant to be updated as new software data 
becomes available. The information presented and processed has been organized to 
make periodic updating a much simpler task. 


A baseline, for this effort, will consist of an estimation of any 
characteristic of a software project that may be helpful to a developer, manager, 
or monitor to manage, control, or influence a software product. The objective of 
these baselints is to provide a tool for aiding software developers in their 

daily work. Baselines have been synthesized from an empirical dataset provided 

' | ' 

by the Software Engineering Laboratory at NASA Goddard Space Flight Center 

(NASA/SEl) . These data have been selected because the data collection effort 
developed at the NASA/SEL is the most thorough and complete available to us. The 
characteristics of the NASA/SEL environment may not be common to most or all 
users. Therefore, the user is advised to calibrate our baselines with his 
professional judgement and experience to provide for the possible differences 
between his and the NASA/SEL environments. 


The baseline effort, defined as an on-going activity, has been broken down 
into several phases;. The motive for the division of the baseline effort into 
successive phases is two fold. The first motive is the desire to provide the 
practitioner with the most current information. Waiting until all variables have 
been analyzed to release the package incurs the risk of providing very outdated 
baselines. Second, and more important for the future development of this effort, 
the plan for producing the baselines may be subject to modifications. The 
practitioner may require different/additional information or the same 
information presented in another form. Therefore, any comments and suggestions 
to adapt, modify or change the present baselines in order to improve their 
practical use is not only welcome but is considered to be an integral part of 
■this effort. 
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2.0 METHODOLOGY 


The Importance of breaking down software projects into smaller and more 
homogeneous subgroups was an insight gained from previous analysis tasks. 
Project heterogeneity was caused by the presence of very different factors which 
were not possible to isolate in different software projects. A solution was 
provided by breaking down the set of software projects into more homogeneous 
subgroups. 

We have selected the current version (February 1983) of the NASA/SEL dataset 
as the empirical base from which to develop software baselines. This set 
contains the latest version of the comprehensive and thorough data collection 
effort performed by the NASA/SEL staff. It exhibits two interesting features. 
First, the data was collected at both the project and module or component levels. 
Second, these components are classified according to their function (See Table 
I). This classification will allow us to characterize each component by the 
function it performs within the project. These module functions also provide the 
scheme for breaking the data into homogeneous subgroups. 

TABLE I: DESCRIPTION OF THE MODULE/COMPONENT FUNCTIONS ANALYZED 

DURING THE PRESENT EFFORT 

NASA/SEL Encoding Dictionary 

Code 

1 

2 

3 

4 

5 

7 

8 

9 

10 

17 

18 

19 


Module/Component 

Name 


Function 


Include 

Control 

System 

Gess 

Data 

CDR 

C COMP 
DTRANS 
10 

IOCDR 

IOCCOMP 

IODTRANS 


Include Statements 

Control Statements (JCL, Overlay) 

System Statements (ALC) 

Graphics Statements (G ESS) 

Data Statements 

FORTRAN Control /Driver Module 

FORTRAN Control /Computational Statements 

FORTRAN Data Transfer Module 

FORTRAN Input/Output Module 

FORTRAN Control/Driver Module with I/O 

FORTRAN Control /Computational Module W 

FORTRAN Data Transfer Module with I/O 
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It is useful to perform a simultaneous analysis as a triple function of data 
analysis (baseline generation), data quality assurance, and research where a 
large dataset such as the NASA/SEL is being studied. The first, baseline 
generation, is the primary objective of this effort. The baselines are designed 
to statistically address the following question: 


What does a module or component of a given function look like? In other 
words, how can I describe a "typical" module that performs a given 
function in terms of its size, effort, runs necessary to develop this 
module, origin, complexity and type of specification? How does this 
situation vary, if any, from the moment this module is given to a 
programmer until the moment this module is ready (i.e., from the NEW to 
the COMPLETED stage)? 


The second, quality assurance, is inherent to statistical analysis. A 
statistician carefully logs in his quality assurance notebook all observed 
inconsistencies during the process of data reduction and analysis. The analysis 
of these isolated inconsistencies provide insight to the process being studied 
and/or improvements to the data collection process. These insights often prove 
useful for both the data collector and the analyst in future efforts. Finally, 
the research function in the baseline generation follows a sequence of 
activities: 


(1) look at large numbers of software components or modules grouped by common 
function to try to isolate the similarities and differences stemming 
from this grouping 

(2) try to determine, given that we are dealing with empirical data, whether 
these similar and different behavioral patterns are arising by chance 

(3) if (2) is not true, to determine if there is sufficient statistical proof 
to state that these patterns are an inherent characteristic of these 
groupings of modules/components 


This: type of information is useful to both the theoretical software engineering 
researcher and active practitioner, the software developer. It may be possible 
in future efforts to uncover a correlation that enables the practitioner to 
obtain one element (module/component) from another or to monitor one element 
while fixing the other, once it can be established that a relationship exists 
between two integral elements of the dataset* 
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In addition, certain relationships are known to hold from theory or 
experience. If it is found that they also hold in the data, it provides an 
indicator of the quality of the information; if they don't, it may either mean 
that the quality of the data is suspect or that there may be some special 
characteristics about this situation that deserve further investigation. This 
indicator becomes a useful working tool. This is the framework in which the 
present research effort has been conducted. Baseline results should follow this 
line of thought. 

We have worked exclusively with the variables shown in Table II in the 
current phase of the baselines task. 


TABLE II: 

DATA DEFINITION FOR VARIABLES USED 

IN PHASE I 


Variable 

Format j 

*" i 

1 . 

Project - code 

1(2) 

2. 

Component - code 

1(3) 

3. 

Module - function 

1(2) 

4. 

System/subsystem 

1(2) 

5. 

Origin- 

1(2) 

6. 

Precision of specifications 

1(2) 

7. 

Complexity 

A(2) 

8. 

Num-comp-cal led 

1(2) 

9. 

Num-cal ling-components 

1(2) 

10. 

Primary-language 

1(2) 

11. 

% of Primary 

1(3) 

12. 

Secondary-.! anguage 

1(2) 

13. 

% of Secondary 

K3) 

14. 

Total - runs 

1(4) 

15. 

Total - time 

1(4) 

16. 

Total - effort 

1(4) 

17. 

Total -source-for-components 

1(8) 

18. 

Development status 

A(2) 


There were several reasons for selecting from all of the possible variables 
existing in the NASA/SEL dataset these 18 variables. The main thrust of the 
current phase of the baselines task is the characterization of project components 
by some type of useful grouping. Project code combined with component code (1 
and 2) provides identification for each module/component, and the module 
function (3) provides the required subgrouping to produce more homogeneous 
subsets. The variable System/subsystem (4) was not used in this phase of the 
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baselines task. The qualitative variables origin (5), precision of the 
specifications (6), and perceived complexity (7), were selected to assess 
whether the subjective appraisal of a module/component by a programmer primarily 
reflects the module's characteristics or rather the programmer's 
characteristics. The number of components called (8) and the number of calling 
components (9) were selected as an indicator of the module complexity with 
respect to its interface within the whole system structure. 


Data on the variables (8) and (9) was not available throughout the entire 
twelve module functions defined in Table I in quantities large enough to support 
analysis. These variables, therefore, were analyzed only in the last four and 
most numerous, module function groups. 

It was observed that each component was written in a single language and that 
only two languages, FORTRAN arid Assembler, were utilized throughout the dataset. 
The variables total runs (14) , total time (IS), total effort (16) and total 
source for components (17) refer to computer runs, time spent by a programmer in 
computer work, programming effort and module size respectively. These last 4 
variables will provide quantitative characterization baselines in subsequent 
phases of this effort. 


The variable development status (18) provides two qualitative classes: New 
and Completed. This variable provided at key analysis tool since it is possible 
to compare the state of the module, represented by all of the above variables, 
before and after its implementation. This type of analysis will yield a valuable 
management tool since it will allow assessing the accuracy of' the estimates 
provided by the programmers at the beginning of their tasks. 


Preliminary results are presented in Tables IIIA, IIIB, IVA, and IVB as 
examples of the type of output obtained from these analyses by module functions. 
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TABLE IIIA: CORRELATION SIZES/SIGNIFICANCES 




NEW (PRE-0EVEL0PMENT ESTIMATES) 



MODULE 

FUNCTION 

EFFORT 

SIZE VS 
TIME 

i 

RUNS 

EFFORT 

TIME 

VS 

RUNS 

TIME VS 
RUNS 

Include 

26/*** 

25/* 

25/NS 

25/** 

25/** 

25/** 

Control 

N/A 

N/A 

N/A 

N/A' 

N/A 

N/A 

System 

10/NS 

9/NS 

8 /NS 

9/* 

8 /NS 

7/NS 

Gess 

85/*** 

84/*** 

84/*** 

88 /*** 

88 /*** 

88 /*** 

Data 

135/*** 

108/*** 

114/*** 

112 /*** 

118/*** 

112 /*** 

COR 

48/*** 

46/*** 

46/*** 

47 /*** 

47/** 

46/** 

C COMP 

28/** 

22 /** 

26/NS 

24 /*** 

28/** 

24/*** 

DTRANS 

53/** 

40/NS 

43/** 

42/*** 

45/* 

42/** 

10 

129/*** 

110 /*** 

119/*** 

111 /*** 

121 /*** 

111 /*** 

IOCDR 

255/*** 

206/*** 

212 /*** 

209/*** 

215/*** 

208/*** 

I0CC0MP 

189/*** 

160/*** 

163/*** 

163/*** 

166/*** 

163/*** 

IODTRANS 

128/*** 

101 /*** 

103/*** 

105/*** 

107/*** 

105/*** 


Legend: 


sufficient data not available to support test 

non-significant 

significant at level a = 0.05 

significant at level a - 0.01 

significant at level a = 0.001 


Footnote to Table IIIA 


1 This table provides in each cell the number of pairs used in the 
estimation of the t correlation coefficient and the significance 
level attained by this coefficient. For example, "26/***" in the 
first cell, indicates that the t correlation coefficient was 
computed for 26 "Include" modules and results were that effort and 
size were correlated at the .001 level of significance. The table is 
useful for 


i) directing the analyst in successive phases of this effort 

ii) evaluating the quality of the data 

iii) proposing new and assessing old research questions 
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MODULE 

FUNCTION 

Include 

Control 

System 

Gess 

Oata 

COR 

C COMP 
DTRANS 
10 

IOCDR 

IOCCOMP 

IODTRANS 


TABLE IIIB: CORRELATION SIZES/SIGNIFICANCES 


COMPLETE (ACTUAL) 


EFFORT 

SIZE VS 
TIME 

RUNS 

EFFORT VS 
TIME : RUNS 

TIME VS 
RUNS 

25/NS 

25/* 

25/** 

25/*** 

25/*** 

25/*** 

N/A 

N/A 

N/A 

N/A 

N/A 

N/A 

7/NS 

N/A 

N/A 

N/A 

N/A 

N/A 

73/*** 

31/** 

32/** 

j 31/*** 

32/*** 

1 

31/*** 

142/*** 

86/*** 

87/*** 

! 86/** 

87/*** 

86/*** 

51/*** 

46/* 

46/** 

46/*** 

46/*** 

46/*** 

36/*** 

27/NS 

27/* 

27/* 

27/** 

27/*** 

60/NS 

39/NS 

39/** 

39/*** 

39/*** 

39/*** 

119/*** 

84/*** 

86/*** 

84/*** 

86/*** 

84/*** 

254/*** 

197/*** 

199/*** 

198/*** 

200/*** 

2Q1/*** 

180/*** 

135/*** 

137/*** 

136/*** 

138/*** 

136/*** 

124/** 

89/*** 

92/** 

89/*** 

92/*** 

89/*** 


Legend:, N/A 

NS 

* 

*** 

sufficient data not 
non-significant 
significant at level 
significant at level 
significant at level 

available to 

a = 0.05 
a = 0.01 
a - 0.001 

support 


Footnote to Ta b 1 e II I g 


tabl ' e Provides in each cell the number of pairs used in the 
Ipiol Vf°f n - 0f H th K e ^correlation coefficient and the significance 
atta ’" ed by th ; s efficient. The interpretation of 
table vs the same as for Table IIIA. The table is useful 


this 


for 


i) directing the analyst in successive phases of this effort 
ii ) evaluating the quality of the data 

m) proposing new and assessing old research questions 
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TABLE IVA: CONTINGENCY TABLE RESULTS 


COMPLEXITY VS COOE ORIGIN 


MODULE FUNCTION 

NEW 

COMPLETE 

Include 

N/A 

N/A 

Control 

N/A 

N/A 

System 

★ ★ 

NS 

Oata 

NS 

NS 

COR 

NS 

NS 

C COMP 

NS 

NS 

DTRANS 

NS 

NS 

10 

*** 

NS 

IOCDR 

NS 

★ 

IOCCOMP 

NS 

NS 

IODTRANS 

NS 

NS 


Legend: 


N/A 

NS 

* 

** 

*** 


sufficient data not available to support test 
non significant at level a = 0.05 
significant at level a = 0.05 
significant at level a = 0.01 
significant at level a = 0.001 


Footnote to Table IVA 


The degree of association between th< two qualitative 
variables complexity and code origin was established 
through contingency tables at the NEW and COMPLETED 
development phases. The results are tabulated here and 
may provide 

i) an evaluation of the quality of the data 
ii) new research questions 
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TABLE IVB: 

CONTINGENCY TABLE RESULTS 


COMPLEXITY VS PRECISION OF SPECIFICATION 


MODULE FUNCTION 

NEW 

COMPLETE 

Include 

N/A 

N/A 

Control 

N/A ! 

N/A 

System 

★ 


Data 

★ ★★ 

**★ 

CDR 

NS i • 

NS 

C COMP 

★ 

NS 

DTRANS 

★ ★ 

NS 

10 


★★★ 

IOCDR 

** 


IOCCOMP 

NS 

★ 

IODTRANS 

NS 

★** 


Legend: N/A sufficient data not available to support test 

NS non-significant at level a = 0.05 
significant at level a = 0.05 
** significant at level a = 0.01 

*** significant at level a - 0.001 


Footnote to Table IVB 

1 The degree of association between the two qualitative 

| variables complexity and code origin was established 

kj through contingency tables at the NEW and COMPLETED 

development phases. The results are tabulated here and 
may provide: 

i) an evaluation of the quality of the data 
ii) new research questions 
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CONCLUSIONS 

■ : i I . 

I The baselines effort is an on-going activity. It has barely started and some 

- elementary baselines will be generated to include a subset of the variables that 

. characterize a module from a functional perspective. The next phase in this 

effort will contemplate completing this characterization process by looking at 
t other variables and exploiting some of the functional relations that have been 
p I explored and established during the present phase. 

if* ' i . 

!• I ; ; = i : 

t It will become necessary to begin a study of the performance measures of 

l these same modules after the characterization process is sufficiently explored. 

f This activity will include the study of productivity and the process of changes 

£ (both error correction and enhancements). Eventually, this will lead to the 

jr- • i ; 

study of different methodologies and other production factors and their 

I ! influence on the behavior of the above-mentioned performance measures and other 

i ! 

? measures suggested by this research. 
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C > 
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Control 

Control Statements (JCL, Overlay) 


3 

System 

System Statements (ALC) 



4 

Gess 

Graphics Statements (GESS) 



5 

Data 

Data Statements 
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CDR 

FORTRAN Control/Driver Module 



8 

C COMP 

FORTRAN Control/Computational Statements 
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DTRANS 

FORTRAN Data Transfer Module 
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FORTRAN Input/Output Module 
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